Vol-3194/paper50
Jump to navigation
Jump to search
Paper
Paper | |
---|---|
edit | |
description | |
id | Vol-3194/paper50 |
wikidataid | →Q117345044 |
title | Annotating Protein Structures for Understanding SARS-CoV-2 Interactome |
pdfUrl | https://ceur-ws.org/Vol-3194/paper50.pdf |
dblpUrl | https://dblp.org/rec/conf/sebd/PuccioLPGV22 |
volume | Vol-3194→Vol-3194 |
session | → |
Annotating Protein Structures for Understanding SARS-CoV-2 Interactome
Annotating Protein Structures for Understanding SARS-CoV-2 Interactome Barbara Puccio1 , Ugo Lomoio1 , Luisa Di Paola2 , Pietro Hiram Guzzi4 and Pierangelo Veltri1 1 Department of Surgical and Medical Sciences, Magna Graecia University, Catanzaro, Italy 2 Unit of Chemical-Physics Fundamentals in Chemical Engineering, Department of Engineering, Università Campus Bio-Medico di Roma, via Álvaro del Portillo 21, 00128 Rome, Italy. Abstract Protein Contact Network (PCN) is an emerging paradigm for modelling protein structure. A common approach to interpreting such data is through network-based analyses. It has been shown that clustering analysis may discover allostery in PCN. Nevertheless Network Embedding has shown good performances in discovering hidden communities and structures in network. SARS-CoV-2 proteins, and in particular S protein, have a modular structure that need to be annotated to understand complex mechanism of infections. Such annotations, and in particular the highlighting of regions participating in the binding of human ACE2 and TMPRSS, may help the design of tailored strategy for preventing and blocking infection. In this work, we compare some approaches for graph embedding with respect to some classical clustering approaches for annotating protein structures. Results shows that embedding may reveal interesting structure that constitute the starting point for further analysis. Keywords Protein Structure, Protein Contact Network, Data Annotations, 1. Introduction Proteins are polymers made of twenty different amino acids organised to assemble a linear chain. The linear sequence of the amino acid determines the spatial conformation of proteins [1]. The spatial structure of proteins is characterized by the presence of a central carbon atom (called carbon-C), a carboxyl group, an amino group and a lateral chain, different for each amino acid (this chain can be hydrophobic, no polar or charged). They are linked to each other by covalent bonds (that are called peptide bonds) between molecules [2, 3]. The amino acids sequence is known as primary structure. The secondary structure indicates the folding of peptides, i.e. protein subsquences, chain resulting from the interaction between each amino acid and neighboring amino acids. The main types of secondary structure are 𝛼-helix and 𝛽-sheet. The tertiary structure is a combination of secondary structures, that makes a complex molecular shape (3D-shape). A protein in its 3D-conformation is called ’native’ and SEBD 2022: The 30th Italian Symposium on Advanced Database Systems, June 19-22, 2022, Tirrenia (PI), Italy $ barbara.puccio@unicz.it (B. Puccio); ugo.lomoio@studenti.unicz.it (U. Lomoio); luisa.dipaola@unicampus.it (L. D. Paola); hguzzi@unicz.it (P. H. Guzzi); veltri@unicz.it (P. Veltri) � 0000-0001-5542-2997 (P. H. Guzzi); 0000-0003-2494-0294 (P. Veltri) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) �this is closely connected with its biological function. Finally, the quaternary structure is only present in proteins with multiple subunits (peptide chains) that can be the same or different. In such a scenario Protein Contact Networks (PCNs) emerged as a relevant paradigm for the analysis of protein molecular structures [4, 5]. A PCN is a graph built from protein structure. A node in a PCN represents an Carbon-𝛼 atom of the backbone, while an edge represents a spatial distance of the atoms in the range 4 and 8 Å. PCN descriptors are useful to model and analyse protein functions [4]. PCNs allow to identify modules in protein molecules through network spectral clustering [6, 7, 8, 9, 10], with relevant application in different biological contexts [11, 12]. For instance, analysis of PCNs allows to detect such as allosteric regulation. 𝐴𝑙𝑙𝑜𝑠𝑡𝑒𝑟𝑦 is the ability of proteins to trasmit a signal from one site to another in response to environmental stimuli and this is related to the trasmission of information across the protein from a sensor site (or allosteric site) to the effector site (or binding site)[13, 14]. Allostery may also be studied using wet lab methods such as X-ray or NMR structures correspondent to different activation states or molecular dynamics simulation of allosteric agent binding. Such methods are usually time and resource consuming, therefore there is the need to introduce computational method to detect allostery and then to annotate allosteric regions. Starting from PCNs it is possibile to detect modules in protein structure using clustering algorithm approach. In a graph a cluster is a group of nodes that are characterized by a strong intra-cluster connection (in terms of number of contacts) and a weaker inter-cluster connection. Clustering allows to detect community (cluster) in a graph and this is perfectly comparable to the modules detection in protein structure. Two methods have been devised to partition PCNs into clusters: a geometrical method, based on the k-means algorithm and spectral clustering, and the clustering of the embedding of the network. PCNs allows to simplify protein analysis to detect modules, essential in allosteric regulation. On the other hand graphs consist of high number of nodes and links (particularly in protein world) thus it is challenging to apply different mathematical and statistical operations. In this situation, embeddings appear as a reasonable solution. Based on potential of graph embeddings, we propose a PCNs analysis using clustering approaches on embeddings, in order to discover allostery in PCN and annotate protein structures. We use PCN-Miner, a software tool implemented in the Python programming language able to import protein in the Protein Data Bank format and generate the corresponding protein contact network. Also it offers a set of algorithms for the PCN analysis[15, 16, 17, 18]. As previously reported, in this work we focus on application of clustering algorithm on the embeddings with the aim of evaluating network embedding approaches in PCN analysis. Our analysis based on SARS-CoV-2 spike glycoprotein, in its closed form, and some of its Variants of Concern [11, 15, 19, 20]. 2. Protein Contact Networks A protein structure can be represented as a complex three-dimensional object, formally defined by the coordinates in 3D space of its atoms. Despite the large availability of protein molecu- lar structures data, there are yet many problems regarding the relationship between protein �structures and their functions. For this reason it is necessary to define simple descriptors that can describe protein structures with few numerical variables. Structure and function are based on the complex network of inter-residue interactions, where residues are identified by amino acids sequences [21]. Therefore, the residues interactions are used to define protein interaction networks. Protein interaction networks are thus used to study protein functions. The most simple choice to define networks, is to represent the protein structure by means of 𝛼-carbon location. The spatial position of 𝐶𝛼 is still reminiscent of the protein backbone and this allows to highlight also the most important characteristics of the three-dimensional structure. Starting from the 𝐶𝛼 spatial distribution, a distance matrix d is evaluated where each 𝑑𝑖,𝑗 represents the Euclidean distance in the 3D space between the 𝑖-th and 𝑗-th residues, defined as √︁ 𝑑𝑖,𝑗 = ((𝑥𝑖 − 𝑥𝑗 )2 ) + ((𝑦𝑖 − 𝑦𝑗 )2 ) + (𝑧𝑖 − 𝑧𝑗 )2 ) (1) where (𝑥𝑖 , 𝑦𝑖 , 𝑧𝑖 ) and (𝑥𝑗 , 𝑦𝑗 , 𝑧𝑗 ) respectively are the cartesian coordinates of residue 𝑖 and 𝑗. Matrix 𝑑 is used to define a Protein Contact Network concept, that is an alternative and different representation of using graph-based models to represent protein structures. A graph is the most natural structure to represent proteins, where nodes (or vertices) are the protein residues and links (or edges) between the 𝑖-th and the 𝑗-th nodes (residues) represent residue contacts. In the graph representation there exists a link between two residues 𝑖 and 𝑗 if the distance between two residues (i.e., 𝑑𝑖,𝑗 ) is higher than 4 and lower than 8 Å. The lower end excludes all covalent bonds, which are not sensible to environment change (so to protein functionality), while the upper end gets rid of weaker non-covalent bonds (so not significant for protein functionality). At this point, it is possible to build up adjacency matrix A, whose generic element is defined as: if 4 ≤ 𝑑𝑖𝑗 ≤ 8 {︂ 1 𝐴𝑖𝑗 = (2) 0 otherwise The adjacency matrix of a graph is unique in regard to the ordering nodes. In the case of proteins, in which the order of nodes (residues) corresponds to the residues sequence (primary structure) it can be said that its corresponding network is unique: this establishes a one-to-one correspondence between protein and its network. 3. Protein Contact Network Analysis 𝐺𝑟𝑎𝑝ℎ 𝑒𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔𝑠 are the transformation of property graphs to a vector or a set of vectors. Embedding should capture the graph topology, vertex-to-vertex relationship, and other rele- vant information about graphs, subgraphs, and vertices. There are a few reasons why graph embeddings are needed: 1. Graphs consist of edges and nodes. Those network relationships can only use a specific subset of mathematics, statistics, and machine learning, while vector spaces have a richer toolset of approaches. 2. Adjacency matrix describes connections between nodes in the graph. It is a | 𝑉 | x | 𝑉 | matrix, where | 𝑉 | is a number of nodes in the graph. Each column and each row in the matrix present a node. Non-zero values in the matrix indicate that two � nodes are connected. Using an adjacency matrix as a feature space for large graphs is almost impossible. Imagine a graph with 1M nodes and an adjacency matrix of 1M x 1M. Embeddings are more practical than the adjacency matrix since they pack node properties in a vector with a smaller dimension. 3. Vector operations are simpler and faster than comparable operations on graphs. The development of novel methods fro encoding structural information of graph to be used for subsequent analysis is a recent area in research. These methods are usually referred to as graph representation learning or graph-embedding [22, 23]. The goal of these approaches is learning a mapping for graph substructures (i.e. nodes or sub graphs) into points of a low-dimensional vector space R𝑑 , having 𝑑 < 𝑛, n is the dimension of the adjacency matrix [22]. We here focus on node-based embeddings, thus all these methods realise a mapping among nodes and point of the embedding space so that geometric relationships among embedded objects reflect the structure of the original graph. Since embeddings are points of an euclidean space, they may be used in other machine learning tasks (e.g. node classification) or in other graph analysis algorithms. Currently, there exists many algorithms and many classification attempts that are categorised and described in some previous surveys [24, 25, 26, 27, 22, 28]. The input of representation learning algorithms is a undirected and unweighted graph 𝐺 = (𝑉, 𝐸) with its associated adjacency matrix 𝐴 and a real-valued matrix 𝑋 containing node attributes 𝑋 ∈ 𝑅𝑚𝑥|𝑉 | . The goal of each algorithm is to map each node into a vector z ∈ R𝑑 where 𝑑 < |𝑉 |. Shallow embedding methods encode each node (𝑣𝑖 ∈ 𝐺) into a single vector through the use of a simple encoding function defined as: 𝐸𝑁 𝐶(𝑣𝑖 ) = 𝑀 𝑣𝑖 (3) where M is a matrix containing the embedding vectors and 𝑣𝑖 is a vector used for selecting the column. The matrix M contains all the embeddings. Each column of M encodes a node of the original graph and the number of rows 𝑑 is lower than the number of nodes 𝑛. These embeddings were initially inspired by matrix factorization approaches. The differences among these methods are in the use of different loss function and similarity measures. 4. Workflow of the Experiment In this work we focused on the comparation between a direct analysis of PCNs using network clustering and network embedding approach followed by clustering on the embeddings. We used PCN-Miner, implemented in Python 3.8 programming. It uses scypy and numpy libraries for managing matrices; management of PDB files is provided by ProDy package; the network embedding is realised by wrapping the GEM library and clustering algorithms bu CdLib; visual- isation of protein structures is made by wrapping the community edition of PyMol. Analysis started from protein data. The reference database for protein structures is the Protein Data Bank (PDB https://www.rcsb.org/), which also defines the PDB format, a standard for recording atom files. Using PDB files we obtain protein contact networks. �Figure 1: Workflow First step consist of import protein structure in PDB format from which it is possible to obtain the PCN (alternatively we can directly import a PCN previously determined). After we access analysis fuctionalities. On one hand we work with network clustering, on the other we work with network embeddings followed by clustering on the embeddings. Therefore we compared the results by comparison of centrality measures and participation coefficient, computed for each residue. We focused on four structures of Spike protein, in its closed form: the wild type and three variants of concern (alpha, delta, omicron). Figure 2 shows the Structure of the SARS-CoV-2 spike glycoprotein pdb code 6VXX. We first built the protein contact network using PCN-Miner. Then we embedded the resulting graph by using the HOPE algorithm. Each node was embedded into a vector having 64 dimension. Finally we applied the spectral clustering algorithm. We found some interesting community that could be annotated as allosteric regions after verification. Figure 3 shows the Structure of the SARS-CoV-2 spike glycoprotein (closed state)(pdb code 6VXX). This structure is the result of the clustering analysis, with soft clustering using a normalised lapacian. Figure evidences the found communities. Figure 4 shows the structure of Closed state of pre-fusion SARS-CoV-2 Delta variant spike protein (pdb code 7SBK). The structure is the result of the embeddings+clustering analysis, with HOPE as embdeddings algorithm. Figure 5 shows the Structure of the SARS-CoV-2 spike glycoprotein (closed state)(pdb code 6VXX). Closed state of pre-fusion SARS-CoV-2 Delta variant spike protein (pdb code 7SBK). This structure is the result of the clustering analysis, with soft clustering using a normalised lapacian. �Figure 2: Structure of the SARS-CoV-2 spike glycoprotein (closed state)(pdb code 6VXX). This structure is the result of the embeddings+clustering analysis, using the HOPE for node embedding 5. Conclusion Protein Contact Network (PCN) is an emerging paradigm for modelling protein structure. A common approach to interpreting such data is through network-based analyses. It has been shown that clustering analysis may discover allostery in PCN. Nevertheless Network Embedding has shown good performances in discovering hidden communities and structures in network. We demonstrated how an embedding based approach is able to discover modular substructures of the Spike SARS-CoV-2 protein. Such regions need to be annotated and then further analysed to understand complex mechanism of infections. This may help the design of tailored strategy for preventing and blocking infection. �Figure 3: Structure of the SARS-CoV-2 spike glycoprotein (closed state)(pdb code 6VXX). This structure is the result of the clustering analysis, with soft clustering using a normalised lapacian. Figure 4: Structure of Closed state of pre-fusion SARS-CoV-2 Delta variant spike protein (pdb code 7SBK). This structure is the result of the embeddings+clustering analysis, with HOPE as embdeddings algorithm. �Figure 5: Structure of Closed state of pre-fusion SARS-CoV-2 Delta variant spike protein (pdb code 7SBK). This structure is the result of the clustering analysis, with soft clustering using a normalised lapacian. References [1] P. Kukic, C. Mirabello, G. Tradigo, I. Walsh, P. Veltri, G. Pollastri, Toward an accurate prediction of inter-residue distances in proteins using 2d recursive neural networks, BMC bioinformatics 15 (2014) 1–15. [2] K. Grillone, C. Riillo, F. Scionti, R. Rocca, G. Tradigo, P. H. Guzzi, S. Alcaro, M. T. Di Martino, P. Tagliaferri, P. Tassone, Non-coding rnas in cancer: platforms and strategies for investi- gating the genomic “dark matter”, Journal of Experimental & Clinical Cancer Research 39 (2020) 1–19. [3] J. C. Galicia, P. H. Guzzi, F. M. Giorgi, A. A. Khan, Predicting the response of the dental pulp to sars-cov2 infection: a transcriptome-wide effect cross-analysis, Genes & Immunity 21 (2020) 360–363. [4] T. Khan, I. Ghosh, Modularity in protein structures: study on all-alpha proteins, Journal of Biomolecular Structure and Dynamics 33 (2015) 2667–2681. [5] E. Vocaturo, P. Veltri, On the use of networks in biomedicine, in: Procedia Computer Science, volume 110, 2017, pp. 498–503. [6] S. Tasdighian, L. Di Paola, M. De Ruvo, P. Paci, D. Santoni, P. Palumbo, G. Mei, A. Di Venere, A. Giuliani, Modules identification in protein structures: the topological and geometrical solutions, Journal of chemical information and modeling 54 (2014) 159–168. [7] S. Gu, M. Jiang, P. H. Guzzi, T. Milenković, Modeling multi-scale data via a network of networks, Bioinformatics 38 (2022) 2544–2553. [8] L. Di Paola, H. Hadi-Alijanvand, X. Song, G. Hu, A. Giuliani, The discovery of a putative allosteric site in the sars-cov-2 spike protein using an integrated structural/dynamic approach, Journal of proteome research 19 (2020) 4576–4586. [9] P. Vizza, A. Curcio, G. Tradigo, C. Indolfi, P. Veltri, A framework for the atrial fibrillation prediction in electrophysiological studies, Computer methods and programs in biomedicine � 120 (2015) 65–76. [10] P. Vizza, G. Tradigo, D. Mirarchi, R. B. Bossio, N. Lombardo, G. Arabia, A. Quattrone, P. Veltri, Methodologies of speech analysis for neurodegenerative diseases evaluation, International journal of medical informatics 122 (2019) 45–54. [11] F. Ortuso, D. Mercatelli, P. H. Guzzi, F. M. Giorgi, Structural genetics of circulating variants affecting the sars-cov-2 spike/human ace2 complex, Journal of Biomolecular Structure and Dynamics (2021) 1–11. [12] M. E. Gallo Cantafio, K. Grillone, D. Caracciolo, F. Scionti, M. Arbitrio, V. Barbieri, L. Pens- abene, P. H. Guzzi, M. T. Di Martino, From single level analysis to multi-omics integrative approaches: a powerful strategy towards the precision oncology, High-throughput 7 (2018) 33. [13] L. D. Paola, G. Mei, A. D. Venere, A. Giuliani, Disclosing allostery through protein contact networks, in: Allostery, Springer, 2021, pp. 7–20. [14] Y. Ren, A. Sarkar, P. Veltri, A. Ay, A. Dobra, T. Kahveci, Pattern discovery in multilayer networks, IEEE/ACM Transactions on Computational Biology and Bioinformatics 19 (2022) 741–752. [15] P. H. Guzzi, L. Di Paola, A. Giuliani, P. Veltri, Design and development of pcn-miner: A tool for the analysis of protein contact networks, arXiv preprint arXiv:2201.05434 (2022). [16] G. Canino, P. H. Guzzi, G. Tradigo, A. Zhang, P. Veltri, On the analysis of diseases and their related geographical data, IEEE journal of biomedical and health informatics 21 (2015) 228–237. [17] P. H. Guzzi, G. Tradigo, P. Veltri, Using dual-network-analyser for communities detecting in dual networks, BMC Bioinformatics 22 (2021). [18] J. Kumar Das, G. Tradigo, P. Veltri, P. Guzzi, S. Roy, Data science in unveiling covid- 19 pathogenesis and diagnosis: Evolutionary origin to drug repurposing, Briefings in Bioinformatics 22 (2021) 855–872. [19] D. Mercatelli, E. Pedace, P. Veltri, F. M. Giorgi, P. H. Guzzi, Exploiting the molecular basis of age and gender differences in outcomes of sars-cov-2 infections, Computational and Structural Biotechnology Journal 19 (2021) 4092–4100. [20] E. Zumpano, A. Fuduli, E. Vocaturo, M. Avolio, Viral pneumonia images classification by multiple instance learning: preliminary results, in: IDEAS 2021: 25th International Database Engineering & Applications Symposium, Montreal, QC, Canada, July 14-16, 2021, ACM, 2021, pp. 292–296. URL: https://doi.org/10.1145/3472163.3472170. doi:10.1145/ 3472163.3472170. [21] L. Di Paola, M. De Ruvo, P. Paci, D. Santoni, A. Giuliani, Protein contact networks: an emerging paradigm in chemistry, Chemical reviews 113 (2013) 1598–1613. [22] W. L. Hamilton, R. Ying, J. Leskovec, Representation learning on graphs: Methods and applications, arXiv preprint arXiv:1709.05584 (2017). [23] E. Vocaturo, E. Zumpano, L. Caroprese, Convolutional neural network techniques on x-ray images for covid-19 classification, in: Y. Huang, L. A. Kurgan, F. Luo, X. Hu, Y. Chen, E. R. Dougherty, A. Kloczkowski, Y. Li (Eds.), IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2021, Houston, TX, USA, December 9-12, 2021, IEEE, 2021, pp. 3113– 3115. URL: https://doi.org/10.1109/BIBM52615.2021.9669784. doi:10.1109/BIBM52615. 2021.9669784. �[24] P. Cui, X. Wang, J. Pei, W. Zhu, A survey on network embedding, IEEE Transactions on Knowledge and Data Engineering 31 (2018) 833–852. [25] C. Su, J. Tong, Y. Zhu, P. Cui, F. Wang, Network embedding in biomedical data science, Briefings in bioinformatics 21 (2020) 182–197. [26] W. Nelson, M. Zitnik, B. Wang, J. Leskovec, A. Goldenberg, R. Sharan, To embed or not: network embedding as a paradigm in computational biology, Frontiers in genetics 10 (2019). [27] P. Goyal, E. Ferrara, Graph embedding techniques, applications, and performance: A survey, Knowledge-Based Systems 151 (2018) 78–94. [28] M. Cannataro, P. H. Guzzi, A. Sarica, Data mining and life sciences applications on the grid, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 3 (2013) 216–238. �