Paper

Paper
edit
description
id	Vol-3194/paper50
wikidataid	Q117345044→Q117345044
title	Annotating Protein Structures for Understanding SARS-CoV-2 Interactome
pdfUrl	https://ceur-ws.org/Vol-3194/paper50.pdf
dblpUrl	https://dblp.org/rec/conf/sebd/PuccioLPGV22
volume	Vol-3194→Vol-3194
session	→

Annotating Protein Structures for Understanding SARS-CoV-2 Interactome

Annotating Protein Structures for Understanding
SARS-CoV-2 Interactome
Barbara Puccio1 , Ugo Lomoio1 , Luisa Di Paola2 , Pietro Hiram Guzzi4 and
Pierangelo Veltri1
1
Department of Surgical and Medical Sciences, Magna Graecia University, Catanzaro, Italy
2
Unit of Chemical-Physics Fundamentals in Chemical Engineering, Department of Engineering, Università Campus
Bio-Medico di Roma, via Álvaro del Portillo 21, 00128 Rome, Italy.

Abstract
Protein Contact Network (PCN) is an emerging paradigm for modelling protein structure. A common
approach to interpreting such data is through network-based analyses. It has been shown that clustering
analysis may discover allostery in PCN. Nevertheless Network Embedding has shown good performances
in discovering hidden communities and structures in network. SARS-CoV-2 proteins, and in particular
S protein, have a modular structure that need to be annotated to understand complex mechanism of
infections. Such annotations, and in particular the highlighting of regions participating in the binding
of human ACE2 and TMPRSS, may help the design of tailored strategy for preventing and blocking
infection. In this work, we compare some approaches for graph embedding with respect to some classical
clustering approaches for annotating protein structures. Results shows that embedding may reveal
interesting structure that constitute the starting point for further analysis.

Keywords
Protein Structure, Protein Contact Network, Data Annotations,

1. Introduction
Proteins are polymers made of twenty different amino acids organised to assemble a linear
chain. The linear sequence of the amino acid determines the spatial conformation of proteins
[1]. The spatial structure of proteins is characterized by the presence of a central carbon atom
(called carbon-C), a carboxyl group, an amino group and a lateral chain, different for each amino
acid (this chain can be hydrophobic, no polar or charged). They are linked to each other by
covalent bonds (that are called peptide bonds) between molecules [2, 3].
The amino acids sequence is known as primary structure. The secondary structure indicates
the folding of peptides, i.e. protein subsquences, chain resulting from the interaction between
each amino acid and neighboring amino acids. The main types of secondary structure are
𝛼-helix and 𝛽-sheet. The tertiary structure is a combination of secondary structures, that makes
a complex molecular shape (3D-shape). A protein in its 3D-conformation is called ’native’ and

SEBD 2022: The 30th Italian Symposium on Advanced Database Systems, June 19-22, 2022, Tirrenia (PI), Italy
$ barbara.puccio@unicz.it (B. Puccio); ugo.lomoio@studenti.unicz.it (U. Lomoio); luisa.dipaola@unicampus.it
(L. D. Paola); hguzzi@unicz.it (P. H. Guzzi); veltri@unicz.it (P. Veltri)
� 0000-0001-5542-2997 (P. H. Guzzi); 0000-0003-2494-0294 (P. Veltri)
© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org)
�this is closely connected with its biological function. Finally, the quaternary structure is only
present in proteins with multiple subunits (peptide chains) that can be the same or different.
In such a scenario Protein Contact Networks (PCNs) emerged as a relevant paradigm for the
analysis of protein molecular structures [4, 5]. A PCN is a graph built from protein structure. A
node in a PCN represents an Carbon-𝛼 atom of the backbone, while an edge represents a spatial
distance of the atoms in the range 4 and 8 Å.
PCN descriptors are useful to model and analyse protein functions [4]. PCNs allow to identify
modules in protein molecules through network spectral clustering [6, 7, 8, 9, 10], with relevant
application in different biological contexts [11, 12].
For instance, analysis of PCNs allows to detect such as allosteric regulation. 𝐴𝑙𝑙𝑜𝑠𝑡𝑒𝑟𝑦 is the
ability of proteins to trasmit a signal from one site to another in response to environmental
stimuli and this is related to the trasmission of information across the protein from a sensor
site (or allosteric site) to the effector site (or binding site)[13, 14]. Allostery may also be studied
using wet lab methods such as X-ray or NMR structures correspondent to different activation
states or molecular dynamics simulation of allosteric agent binding. Such methods are usually
time and resource consuming, therefore there is the need to introduce computational method
to detect allostery and then to annotate allosteric regions.
Starting from PCNs it is possibile to detect modules in protein structure using clustering
algorithm approach. In a graph a cluster is a group of nodes that are characterized by a strong
intra-cluster connection (in terms of number of contacts) and a weaker inter-cluster connection.
Clustering allows to detect community (cluster) in a graph and this is perfectly comparable to
the modules detection in protein structure. Two methods have been devised to partition PCNs
into clusters: a geometrical method, based on the k-means algorithm and spectral clustering,
and the clustering of the embedding of the network.
PCNs allows to simplify protein analysis to detect modules, essential in allosteric regulation.
On the other hand graphs consist of high number of nodes and links (particularly in protein
world) thus it is challenging to apply different mathematical and statistical operations. In this
situation, embeddings appear as a reasonable solution. Based on potential of graph embeddings,
we propose a PCNs analysis using clustering approaches on embeddings, in order to discover
allostery in PCN and annotate protein structures.
We use PCN-Miner, a software tool implemented in the Python programming language able to
import protein in the Protein Data Bank format and generate the corresponding protein contact
network. Also it offers a set of algorithms for the PCN analysis[15, 16, 17, 18]. As previously
reported, in this work we focus on application of clustering algorithm on the embeddings with
the aim of evaluating network embedding approaches in PCN analysis. Our analysis based
on SARS-CoV-2 spike glycoprotein, in its closed form, and some of its Variants of Concern
[11, 15, 19, 20].

2. Protein Contact Networks
A protein structure can be represented as a complex three-dimensional object, formally defined
by the coordinates in 3D space of its atoms. Despite the large availability of protein molecu-
lar structures data, there are yet many problems regarding the relationship between protein
�structures and their functions. For this reason it is necessary to define simple descriptors that
can describe protein structures with few numerical variables. Structure and function are based
on the complex network of inter-residue interactions, where residues are identified by amino
acids sequences [21]. Therefore, the residues interactions are used to define protein interaction
networks. Protein interaction networks are thus used to study protein functions. The most
simple choice to define networks, is to represent the protein structure by means of 𝛼-carbon
location. The spatial position of 𝐶𝛼 is still reminiscent of the protein backbone and this allows
to highlight also the most important characteristics of the three-dimensional structure. Starting
from the 𝐶𝛼 spatial distribution, a distance matrix d is evaluated where each 𝑑𝑖,𝑗 represents the
Euclidean distance in the 3D space between the 𝑖-th and 𝑗-th residues, defined as
√︁
𝑑𝑖,𝑗 = ((𝑥𝑖 − 𝑥𝑗 )2 ) + ((𝑦𝑖 − 𝑦𝑗 )2 ) + (𝑧𝑖 − 𝑧𝑗 )2 ) (1)

where (𝑥𝑖 , 𝑦𝑖 , 𝑧𝑖 ) and (𝑥𝑗 , 𝑦𝑗 , 𝑧𝑗 ) respectively are the cartesian coordinates of residue 𝑖 and 𝑗.
Matrix 𝑑 is used to define a Protein Contact Network concept, that is an alternative and different
representation of using graph-based models to represent protein structures. A graph is the most
natural structure to represent proteins, where nodes (or vertices) are the protein residues and
links (or edges) between the 𝑖-th and the 𝑗-th nodes (residues) represent residue contacts. In the
graph representation there exists a link between two residues 𝑖 and 𝑗 if the distance between
two residues (i.e., 𝑑𝑖,𝑗 ) is higher than 4 and lower than 8 Å. The lower end excludes all covalent
bonds, which are not sensible to environment change (so to protein functionality), while the
upper end gets rid of weaker non-covalent bonds (so not significant for protein functionality).
At this point, it is possible to build up adjacency matrix A, whose generic element is defined as:

if 4 ≤ 𝑑𝑖𝑗 ≤ 8
{︂
1
𝐴𝑖𝑗 = (2)
0 otherwise
The adjacency matrix of a graph is unique in regard to the ordering nodes. In the case of
proteins, in which the order of nodes (residues) corresponds to the residues sequence (primary
structure) it can be said that its corresponding network is unique: this establishes a one-to-one
correspondence between protein and its network.

3. Protein Contact Network Analysis
𝐺𝑟𝑎𝑝ℎ 𝑒𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔𝑠 are the transformation of property graphs to a vector or a set of vectors.
Embedding should capture the graph topology, vertex-to-vertex relationship, and other rele-
vant information about graphs, subgraphs, and vertices. There are a few reasons why graph
embeddings are needed:
1. Graphs consist of edges and nodes. Those network relationships can only use a specific
subset of mathematics, statistics, and machine learning, while vector spaces have a richer
toolset of approaches.
2. Adjacency matrix describes connections between nodes in the graph. It is a | 𝑉 | x
| 𝑉 | matrix, where | 𝑉 | is a number of nodes in the graph. Each column and each
row in the matrix present a node. Non-zero values in the matrix indicate that two
� nodes are connected. Using an adjacency matrix as a feature space for large graphs is
almost impossible. Imagine a graph with 1M nodes and an adjacency matrix of 1M x 1M.
Embeddings are more practical than the adjacency matrix since they pack node properties
in a vector with a smaller dimension.
3. Vector operations are simpler and faster than comparable operations on graphs.
The development of novel methods fro encoding structural information of graph to be used for
subsequent analysis is a recent area in research. These methods are usually referred to as graph
representation learning or graph-embedding [22, 23]. The goal of these approaches is learning a
mapping for graph substructures (i.e. nodes or sub graphs) into points of a low-dimensional
vector space R𝑑 , having 𝑑 < 𝑛, n is the dimension of the adjacency matrix [22]. We here focus
on node-based embeddings, thus all these methods realise a mapping among nodes and point
of the embedding space so that geometric relationships among embedded objects reflect the
structure of the original graph. Since embeddings are points of an euclidean space, they may
be used in other machine learning tasks (e.g. node classification) or in other graph analysis
algorithms.
Currently, there exists many algorithms and many classification attempts that are categorised
and described in some previous surveys [24, 25, 26, 27, 22, 28].
The input of representation learning algorithms is a undirected and unweighted graph
𝐺 = (𝑉, 𝐸) with its associated adjacency matrix 𝐴 and a real-valued matrix 𝑋 containing node
attributes 𝑋 ∈ 𝑅𝑚𝑥|𝑉 | . The goal of each algorithm is to map each node into a vector z ∈ R𝑑
where 𝑑 < |𝑉 |.
Shallow embedding methods encode each node (𝑣𝑖 ∈ 𝐺) into a single vector through the use
of a simple encoding function defined as:

𝐸𝑁 𝐶(𝑣𝑖 ) = 𝑀 𝑣𝑖 (3)

where M is a matrix containing the embedding vectors and 𝑣𝑖 is a vector used for selecting the
column. The matrix M contains all the embeddings.
Each column of M encodes a node of the original graph and the number of rows 𝑑 is lower
than the number of nodes 𝑛. These embeddings were initially inspired by matrix factorization
approaches. The differences among these methods are in the use of different loss function and
similarity measures.

4. Workflow of the Experiment
In this work we focused on the comparation between a direct analysis of PCNs using network
clustering and network embedding approach followed by clustering on the embeddings. We
used PCN-Miner, implemented in Python 3.8 programming. It uses scypy and numpy libraries
for managing matrices; management of PDB files is provided by ProDy package; the network
embedding is realised by wrapping the GEM library and clustering algorithms bu CdLib; visual-
isation of protein structures is made by wrapping the community edition of PyMol. Analysis
started from protein data. The reference database for protein structures is the Protein Data
Bank (PDB https://www.rcsb.org/), which also defines the PDB format, a standard for recording
atom files. Using PDB files we obtain protein contact networks.
�Figure 1: Workflow

First step consist of import protein structure in PDB format from which it is possible to obtain
the PCN (alternatively we can directly import a PCN previously determined). After we access
analysis fuctionalities. On one hand we work with network clustering, on the other we work
with network embeddings followed by clustering on the embeddings. Therefore we compared
the results by comparison of centrality measures and participation coefficient, computed for
each residue.
We focused on four structures of Spike protein, in its closed form: the wild type and three
variants of concern (alpha, delta, omicron).
Figure 2 shows the Structure of the SARS-CoV-2 spike glycoprotein pdb code 6VXX. We first
built the protein contact network using PCN-Miner. Then we embedded the resulting graph
by using the HOPE algorithm. Each node was embedded into a vector having 64 dimension.
Finally we applied the spectral clustering algorithm. We found some interesting community
that could be annotated as allosteric regions after verification. Figure 3 shows the Structure of
the SARS-CoV-2 spike glycoprotein (closed state)(pdb code 6VXX). This structure is the result
of the clustering analysis, with soft clustering using a normalised lapacian. Figure evidences
the found communities.
Figure 4 shows the structure of Closed state of pre-fusion SARS-CoV-2 Delta variant spike
protein (pdb code 7SBK). The structure is the result of the embeddings+clustering analysis, with
HOPE as embdeddings algorithm.
Figure 5 shows the Structure of the SARS-CoV-2 spike glycoprotein (closed state)(pdb code
6VXX). Closed state of pre-fusion SARS-CoV-2 Delta variant spike protein (pdb code 7SBK).
This structure is the result of the clustering analysis, with soft clustering using a normalised
lapacian.
�Figure 2: Structure of the SARS-CoV-2 spike glycoprotein (closed state)(pdb code 6VXX). This structure
is the result of the embeddings+clustering analysis, using the HOPE for node embedding

5. Conclusion
Protein Contact Network (PCN) is an emerging paradigm for modelling protein structure. A
common approach to interpreting such data is through network-based analyses. It has been
shown that clustering analysis may discover allostery in PCN. Nevertheless Network Embedding
has shown good performances in discovering hidden communities and structures in network.
We demonstrated how an embedding based approach is able to discover modular substructures
of the Spike SARS-CoV-2 protein. Such regions need to be annotated and then further analysed
to understand complex mechanism of infections. This may help the design of tailored strategy
for preventing and blocking infection.
�Figure 3: Structure of the SARS-CoV-2 spike glycoprotein (closed state)(pdb code 6VXX). This structure
is the result of the clustering analysis, with soft clustering using a normalised lapacian.

Figure 4: Structure of Closed state of pre-fusion SARS-CoV-2 Delta variant spike protein (pdb code
7SBK). This structure is the result of the embeddings+clustering analysis, with HOPE as embdeddings
algorithm.
�Figure 5: Structure of Closed state of pre-fusion SARS-CoV-2 Delta variant spike protein (pdb code
7SBK). This structure is the result of the clustering analysis, with soft clustering using a normalised
lapacian.

References
[1] P. Kukic, C. Mirabello, G. Tradigo, I. Walsh, P. Veltri, G. Pollastri, Toward an accurate
prediction of inter-residue distances in proteins using 2d recursive neural networks, BMC
bioinformatics 15 (2014) 1–15.
[2] K. Grillone, C. Riillo, F. Scionti, R. Rocca, G. Tradigo, P. H. Guzzi, S. Alcaro, M. T. Di Martino,
P. Tagliaferri, P. Tassone, Non-coding rnas in cancer: platforms and strategies for investi-
gating the genomic “dark matter”, Journal of Experimental & Clinical Cancer Research 39
(2020) 1–19.
[3] J. C. Galicia, P. H. Guzzi, F. M. Giorgi, A. A. Khan, Predicting the response of the dental
pulp to sars-cov2 infection: a transcriptome-wide effect cross-analysis, Genes & Immunity
21 (2020) 360–363.
[4] T. Khan, I. Ghosh, Modularity in protein structures: study on all-alpha proteins, Journal
of Biomolecular Structure and Dynamics 33 (2015) 2667–2681.
[5] E. Vocaturo, P. Veltri, On the use of networks in biomedicine, in: Procedia Computer
Science, volume 110, 2017, pp. 498–503.
[6] S. Tasdighian, L. Di Paola, M. De Ruvo, P. Paci, D. Santoni, P. Palumbo, G. Mei, A. Di Venere,
A. Giuliani, Modules identification in protein structures: the topological and geometrical
solutions, Journal of chemical information and modeling 54 (2014) 159–168.
[7] S. Gu, M. Jiang, P. H. Guzzi, T. Milenković, Modeling multi-scale data via a network of
networks, Bioinformatics 38 (2022) 2544–2553.
[8] L. Di Paola, H. Hadi-Alijanvand, X. Song, G. Hu, A. Giuliani, The discovery of a putative
allosteric site in the sars-cov-2 spike protein using an integrated structural/dynamic
approach, Journal of proteome research 19 (2020) 4576–4586.
[9] P. Vizza, A. Curcio, G. Tradigo, C. Indolfi, P. Veltri, A framework for the atrial fibrillation
prediction in electrophysiological studies, Computer methods and programs in biomedicine
� 120 (2015) 65–76.
[10] P. Vizza, G. Tradigo, D. Mirarchi, R. B. Bossio, N. Lombardo, G. Arabia, A. Quattrone,
P. Veltri, Methodologies of speech analysis for neurodegenerative diseases evaluation,
International journal of medical informatics 122 (2019) 45–54.
[11] F. Ortuso, D. Mercatelli, P. H. Guzzi, F. M. Giorgi, Structural genetics of circulating variants
affecting the sars-cov-2 spike/human ace2 complex, Journal of Biomolecular Structure
and Dynamics (2021) 1–11.
[12] M. E. Gallo Cantafio, K. Grillone, D. Caracciolo, F. Scionti, M. Arbitrio, V. Barbieri, L. Pens-
abene, P. H. Guzzi, M. T. Di Martino, From single level analysis to multi-omics integrative
approaches: a powerful strategy towards the precision oncology, High-throughput 7 (2018)
33.
[13] L. D. Paola, G. Mei, A. D. Venere, A. Giuliani, Disclosing allostery through protein contact
networks, in: Allostery, Springer, 2021, pp. 7–20.
[14] Y. Ren, A. Sarkar, P. Veltri, A. Ay, A. Dobra, T. Kahveci, Pattern discovery in multilayer
networks, IEEE/ACM Transactions on Computational Biology and Bioinformatics 19
(2022) 741–752.
[15] P. H. Guzzi, L. Di Paola, A. Giuliani, P. Veltri, Design and development of pcn-miner: A
tool for the analysis of protein contact networks, arXiv preprint arXiv:2201.05434 (2022).
[16] G. Canino, P. H. Guzzi, G. Tradigo, A. Zhang, P. Veltri, On the analysis of diseases and their
related geographical data, IEEE journal of biomedical and health informatics 21 (2015)
228–237.
[17] P. H. Guzzi, G. Tradigo, P. Veltri, Using dual-network-analyser for communities detecting
in dual networks, BMC Bioinformatics 22 (2021).
[18] J. Kumar Das, G. Tradigo, P. Veltri, P. Guzzi, S. Roy, Data science in unveiling covid-
19 pathogenesis and diagnosis: Evolutionary origin to drug repurposing, Briefings in
Bioinformatics 22 (2021) 855–872.
[19] D. Mercatelli, E. Pedace, P. Veltri, F. M. Giorgi, P. H. Guzzi, Exploiting the molecular basis
of age and gender differences in outcomes of sars-cov-2 infections, Computational and
Structural Biotechnology Journal 19 (2021) 4092–4100.
[20] E. Zumpano, A. Fuduli, E. Vocaturo, M. Avolio, Viral pneumonia images classification
by multiple instance learning: preliminary results, in: IDEAS 2021: 25th International
Database Engineering & Applications Symposium, Montreal, QC, Canada, July 14-16, 2021,
ACM, 2021, pp. 292–296. URL: https://doi.org/10.1145/3472163.3472170. doi:10.1145/
3472163.3472170.
[21] L. Di Paola, M. De Ruvo, P. Paci, D. Santoni, A. Giuliani, Protein contact networks: an
emerging paradigm in chemistry, Chemical reviews 113 (2013) 1598–1613.
[22] W. L. Hamilton, R. Ying, J. Leskovec, Representation learning on graphs: Methods and
applications, arXiv preprint arXiv:1709.05584 (2017).
[23] E. Vocaturo, E. Zumpano, L. Caroprese, Convolutional neural network techniques on x-ray
images for covid-19 classification, in: Y. Huang, L. A. Kurgan, F. Luo, X. Hu, Y. Chen, E. R.
Dougherty, A. Kloczkowski, Y. Li (Eds.), IEEE International Conference on Bioinformatics
and Biomedicine, BIBM 2021, Houston, TX, USA, December 9-12, 2021, IEEE, 2021, pp. 3113–
3115. URL: https://doi.org/10.1109/BIBM52615.2021.9669784. doi:10.1109/BIBM52615.
2021.9669784.
�[24] P. Cui, X. Wang, J. Pei, W. Zhu, A survey on network embedding, IEEE Transactions on
Knowledge and Data Engineering 31 (2018) 833–852.
[25] C. Su, J. Tong, Y. Zhu, P. Cui, F. Wang, Network embedding in biomedical data science,
Briefings in bioinformatics 21 (2020) 182–197.
[26] W. Nelson, M. Zitnik, B. Wang, J. Leskovec, A. Goldenberg, R. Sharan, To embed or not:
network embedding as a paradigm in computational biology, Frontiers in genetics 10
(2019).
[27] P. Goyal, E. Ferrara, Graph embedding techniques, applications, and performance: A
survey, Knowledge-Based Systems 151 (2018) 78–94.
[28] M. Cannataro, P. H. Guzzi, A. Sarica, Data mining and life sciences applications on the
grid, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 3 (2013)
216–238.
�

Vol-3194/paper50

Paper

Annotating Protein Structures for Understanding SARS-CoV-2 Interactome

Navigation menu

Search