Vol-3194/paper70


Wolfgang Fahl

Paper

Paper
edit
description  
id  Vol-3194/paper70
wikidataid  Q117344928→Q117344928
title  Cross-Social Network Investigation to Highlight Privacy Violations in Data Sharing Activities
pdfUrl  https://ceur-ws.org/Vol-3194/paper70.pdf
dblpUrl  https://dblp.org/rec/conf/sebd/CerrutoCDGP22
volume  Vol-3194→Vol-3194
session  →

Paper[edit]

Paper
edit
description  
id  Vol-3194/paper70
wikidataid  Q117344928→Q117344928
title  Cross-Social Network Investigation to Highlight Privacy Violations in Data Sharing Activities
pdfUrl  https://ceur-ws.org/Vol-3194/paper70.pdf
dblpUrl  https://dblp.org/rec/conf/sebd/CerrutoCDGP22
volume  Vol-3194→Vol-3194
session  →

Cross-Social Network Investigation to Highlight Privacy Violations in Data Sharing Activities[edit]

load PDF

Cross-Social Network Investigation to Highlight
Privacy Violations in Data Sharing Activities
Francesca Cerruto, Stefano Cirillo, Domenico Desiato, Simone Michele Gambardella
and Giuseppe Polese
University of Salerno, Department of Computer Science, Via Giovanni Paolo II, 132, 84084 Fisciano (SA), Italy


                                      Abstract
                                      Social networks represent a vast source of information and have an increasing impact on people’s daily
                                      lives. In fact, they permit to exhibit users’ lives, share emotions, passions, and interactions with other
                                      users around the world. These data need to be monitored because they could produce privacy violations,
                                      especially when they involve sensitive information. In this scenario, the definitions of privacy policies
                                      for safeguarding users’ data represent a difficult challenge that social networks have to deal with. In
                                      fact, although social network platforms offer privacy settings to protect data, often, users are unable
                                      to properly manage them to safeguard their privacy. To this end, in this work, we present a statistical
                                      investigation concerning privacy policies offered by social network platforms. In particular, we have
                                      defined a tool relying on image-recognition techniques capable of exploring social network platforms
                                      and identifying user profiles starting from their pictures. Moreover, we have composed a dataset of
                                      5000 users by retrieving their data available over different social network platforms in order to compare
                                      publicly accessible data provided in the registration phases, and those retrieved by our analysis. The
                                      proposed work underlines privacy violations over social network platforms when privacy policies are
                                      not managed correctly, and is targeted to improve the users’ awareness concerning the spreading and
                                      managing of their data. We have highlighted all the statistical evaluations made over the gathered data
                                      for putting in evidence the privacy issues.

                                      Keywords
                                      Privacy, Social Networks, Data Analysis


1. Introduction
Plenty of people are registered over several social networks, sharing a vast amount of infor-
mation. Moreover, social networks play a fundamental role in human interactions since they
permit people to share emotions, ways of thinking, points of view, and so on. In this scenario,
preserving users’ privacy is crucial for social network platforms since they cannot permit the
jeopardization of users’ data.
   Users tend to use social networks to share information massively, and in most cases, they do
not care about privatizing data and are unaware of the threats they can be exposed to. Moreover,
the growing number of users signing up on these platforms yields the necessity of analyzing
how they manage their privacy, mainly when using multiple social networks.
   Several studies have discussed data privacy issues on social networks [1, 2, 3, 4], but only

SEBD 2022: The 30th Italian Symposium on Advanced Database Systems, June 19-22, 2022, Tirrenia (PI), Italy
$ f.cerruto@studenti.unisa.it (F. Cerruto); scirillo@unisa.it (S. Cirillo); ddesiato@unisa.it (D. Desiato);
m.gambardella24@studenti.unisa.it (S. M. Gambardella); gpolese@unisa.it (G. Polese)
� 0000-0003-0201-2753 (S. Cirillo); 0000-0002-6327-459X (D. Desiato); 0000-0002-8496-2658 (G. Polese)
                                    © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073
                                    CEUR Workshop Proceedings (CEUR-WS.org)
�some of them have provided tools capable of improving users’ awareness when sharing their
data on social networking platforms [5, 6, 7]. In our work, we perform cross social network
analysis on five social network platforms to figure out which is the information that is most
frequently shared over social networks, and that can jeopardize user’s privacy [8, 9].
   In this discussion paper, we describe the tool SOcial Data Analyzer (SODA) proposed in
[10], which is able to find and extract available information of users on different platforms
considering only their photos. SODA allowed us to perform an accurate analysis for revealing
privacy threats linked to incorrect usage of data sharing in social networks. Moreover, the
tool allowed us to evaluate the sensitiveness of information shared by users and perform an
exhaustive analysis to understand how social networks can reconstruct users’ data even if some
of them are protected on other platforms.
   SODA is independent of the privacy settings offered by social networks since it simulates the
search of a real user and retrieves data publicly available in social network profiles. In other
words, if a user has privatized specific information over a specific social network, SODA cannot
retrieve that information. However, if the user has the same information not privatized over a
different social network, the SODA retrieves such information. Thus, we could say that (SODA)
tests the users’ skills in managing privacy settings offered by social network platforms.
   In summary, the main contributions of our study are 𝑖) a new tool capable of finding users
and extracting their data from different social network platforms, and 𝑖𝑖) a detailed analysis
of users’ data extracted from different social networks aiming to evaluate their privacy and
improve their awareness concerning privacy threats in social network platforms.
   The paper is organized as follows, in Section 2 we present our methodology, whereas Section
3 presents the architecture of SODA tool. In Sections 4 and 5 we describe the results of our
analysis. Finally, conclusions and future research directions are discussed in Section 6.

2. Methodology
In this section, we describe our methodology by summarising it in two meaningful steps: the
single- and the cross-social data extrapolation steps.
   In the single social data extrapolation step, the picture and the name of the users are exploited
as the input of the Social Module. Then, for each social network platform, the module performs
specific operations for scraping the content of the pages and extracting photos and names
associated with each profile. This information is given as input to a face recognition module
that tries to match the discovered photos and the initial user’s picture. If a match is found, the
social network module extracts all data available on a user profile.
   Similar to the single social step, the cross-social data extrapolation step starts by considering
the picture and the name of a user for searching a user from multiple platforms. However,
the main difference is the exploitation of multiple social network modules for extrapolating
several user profile data. In particular, each module extracts user profile data from each social
network platform according to its representation of the data. In this way, it is possible to collect
several user profile data from different social networks. Obviously, the only limitation is that the
target user needs to own a registered profile on each social network. Finally, all the user profile
data associated with each social network feed the integration module, which is responsible for
aggregating all the collected data.
�  The differentiation of the single- and cross-social analysis allows us to estimate the minimum
amount of user data that is possible to extract from each social network and evaluate the
maximum number of data that can be aggregated from different platforms. In the following
section, we first provide an overview of the new SODA system, and then we describe the
architecture underlying it.

3. Social Data Analyzer
Extracting user data from multiple social networks is a complex problem. There are several
issues related to extraction yielding specific choices for the components of the SODA tool: 𝑖)
the number of users involved in the analysis process can be large, 𝑖𝑖) each social network relies
on different implementation technologies, and 𝑖𝑖𝑖) continuous updates of the social network
platforms require continuous maintenance of system components. To this end, we have built the
tool SODA on top of the existing system Social Mapper1 , extending several of its components
aiming to tackle the issues mentioned above.
   In particular, Social Mapper is capable to search user profiles on multiple social network
platforms such as Facebook, Linkedin, Instagram, VKontakte, Twitter, Pinterest, Weibo, and
Douban. Because SODA is an extension of Social Mapper, it can search people by only con-
sidering an image and at least one information including name, surname, city, email, or the
company in which the user works. Starting from these, SODA exploits the Selenium framework
for creating a bot capable of automatically browsing web pages, by simulating the behaviors of
a real user during a web browsing session. In this way, SODA can exploit the search engines
behind each social network platform, by performing searching operations by means of the
search bars provided by each platform.
   With respect to Social Mapper, SODA provides several novel functionalities enabling to
perform an in-depth analysis of the data shared by users, and extend the search on a large scale.
The first new functionality allows SODA to find people working for a specific company, by
exploiting the search mechanism of Linkedin for selecting users that work in a given company.
As demonstrated in [11] the amount of fake users registered on Linkedin is very small, which
allowed us to create a dataset with a large number of real users. Most of the remaining extensions
concern the crawling components. In fact, Social Mapper is limited to only extracting the URLs
of the different user profiles. Thus, in SODA, we have re-designed the crawler modules aiming
to add several new navigation features capable of adapting to the different structures of the
web pages. The combination of these strategies with a powerful recognition algorithm, i.e.,
Viola-Jones [12], allows SODA to achieve accurate results on multiple platforms. It is important
to notice that, the face recognition algorithm return as output a user profile if and only if
the image is at least 60% compatible with the input one and if the data correspond with it.
This threshold ensures that the number of false positives is minimized. Figure 1 shows the
architecture of SODA. The data are acquired by the Parser component, which is responsible
for interpreting the system input, trying to understand the execution modes and for sharing
information of each user with the Face Recognition module. Moreover, the Parser invokes the
Browser Connector module interface, which enables SODA to execute the local web browser.
After which, it is necessary to interact with the web pages and extract information. To this
   1
       https://github.com/Greenwolf/social_mapper
�end, SODA exploits the functionalities provided by Selenium. More specifically, to extract
specific information on each social, we defined six modules, one for each social network on
which we can access user profiles and extract their information. In particular, SODA crawlers
search for a user by using the initial information read by the Parser module and extract all the
profile pictures of the users that match the search criteria. The list of pictures is sent to the Face
Recognition component, which compares the image taken in input with those extracted from
the social networks in order to identify the correct subjects to be analyzed. The list of identified
subjects is shared with the crawling modules, which acquire all information of each profile,
storing them locally. Finally, the Aggregator component receives all the data and groups all the
information extracted by the crawlers in a single file.
         SODA
                                                                            http://localhost:3000/index.html




                                                                                                                                user profile
                                                                                                                                     photos
           Social Mapper

                                                                                                                         user profile
                                                                                                                            photos             Selenium
                 User        user data             Parser                          Browser Connector                                           Crawler
                 Data      link to local         Component                              Module
                              photo


                           link to local photo                                                                                          Facebook          Twitter
                                                               user photo              Face Recognition




                                                                                                                                                                      crawler modules
                 User                                                                     Component                                     crawler           crawler
                Photos         user photo
                                                                                                                                        Linkedin          Pinterest
                               Users Profile                                                 Aggregator        user profile             crawler           crawler
                                   Data                                                      Component            data
                                                                                                                                        Instagram         Vkontakte
                                                                                                                                         crawler           crawler
          data layer                                  business layer



Figure 1: Architecture of SODA.

4. Evaluation single-social
The single social evaluation allowed us to highlight the information that is frequently shared
by users over every single social network and analyze how each social network preserves
user privacy. To perform this type of analysis, we have created a dataset of 5000 users, by
considering people working for a specific company that has been extracted by means of a new
feature of SODA enabling search people working for a specific company. Starting from the
5000 users contained in our dataset, we perform a single social network evaluation with the
aim to independently analyze the results obtained by each social network, avoiding considering
whether a user is present on multiple platforms, which will be discussed in Section 5. It is
important to note that the number of people evaluated for each social network corresponds to
those who were actually found by SODA after exploring the different platforms.
   Concerning Linkedin, 1570 user profiles have been evaluated. In particular, Employment
and the City are the most frequently shared information on Linkedin. More specifically, the
attribute city can refer to the place of residence or the place of birth, but in most cases, these
are equal. Results highlight even more that Linkedin is a social network for job finding where
users tend to share their employment and city, aiming to find better job opportunities.
   Concerning Facebook, 1161 user profiles have been evaluated. In particular, the basic infor-
mation related to the gender, Education or Work, and the Place where the user lives are the most
�frequently shared information on this social network. However, no user has shared his/her
details on the date of birth which, combined with the other data, could significantly affect
privacy. Facebook permits users to hide their date of birth in order to preserve privacy.
   Concerning Twitter, 86 user profiles have been evaluated. Despite not many users involved in
the analysis, the City, Website, and the Biography of a user are the most shared information on
this social network. In particular, through the biography, a user can share additional information,
such as his/her telephone number, email, or other information. Twitter is used by many famous
people, but it offers less prevention in terms of privacy, mainly due to the fact that users tend to
insert data in their biography, not being aware to disclose them.
   Concerning VKontakte, 251 user profiles have been evaluated. In particular, the Date of birth,
the Spoken languages, and the Education information are the most frequently shared data on
this social network. More specifically, not many users have shared their Telephone numbers.
As Facebook, also VKontakte is a social network that allows users to share a vast amount of
information, and it permits users to hide specific details to preserve privacy.
   Concerning Pinterest and Instagram, 1688 and 2845 user profiles have been evaluated. In
particular, these two social networks are massively used for sharing photos, and no other
types of data have been found for our analysis. Furthermore, the only textual information on
Instagram that seemed helpful in our analysis was the user biography. Yet, a user can write
anything in it, so we have decided not to take the biography into account for our analysis.

5. Evaluation cross-social
In this section, we describe the statistics derived by performing a cross-social analysis of the
publicly accessible information extracted from available social networks, and we investigate
the possibility of aggregating them aiming to perform a more detailed analysis.
   Figure 2 shows the distribution diagram for the users registered over the considered social
network platforms. In particular, except for the first bar that highlights the number of users
involved in no social networks, it is possible to group the other bars in three blocks, representing
the users found in one, two, and three social network platforms, respectively.

                           103
         Number of users




                           102

                           101

                           100
            Twitter
        VKontakte
         Facebook
          LinkedIn
          Pinterest
        Instagram


Figure 2: Distribution diagram of the analyzed users.
   A cross-social analysis permits the reconstruction of information over different social net-
works. For example, a user registered on several social networks can decide to privatize some
information on a specific social network, where s/he can choose to unmask the same information
�over other social networks. It means that it is possible to obtain more detailed information by
analyzing a specific user over different social networks.
   The most frequently accessed information on Twitter is the city since it can be reconstructed
through other social networks. In particular, 4923 users out of 5000 analyzed are not registered
on Twitter or have privatized this information on it. However, 31% of 4923 users published
their city on Linkedin, while 5% on Facebook, and 1% on VKontakte. The remaining 63% out of
4923 users did not share this information over any considered social network, leading to the
impossibility of extracting the information concerning their city. Consequently, only in the last
case, it is possible to guarantee the confidentiality of the data (e.g., city), by simply requiring
the management of its privatization over just one social network (e.g., Twitter).
   The information that is most frequently accessible on Facebook is Mobile phone, City, Date
of birth, Email, and information concerning Education and Training or Work. For our analysis
on Facebook, we have merged the last two attributes. We detail the percentage of information
privatized by Facebook users but published on other social networks:
    • Among the 5000 analyzed users who have privatized their Mobile number on Facebook,
      no one has allowed the reconstruction of this information from other social networks;
    • Among the 5000 users analyzed, 4743 have privatized their Hometown or Residence on
      Facebook or are not registered to this social network. Among them, 31% have published
      this information on Linkedin, 2% on Twitter, and 1% on VKontakte. Thus, 34% of them
      allow the reconstruction of this information from other social networks;
    • Among 5000 analyzed users who have privatized their Date of birth on Facebook or are
      not registered to this social network, 3% shared it on VKontakte, and 3% on Linkedin.
      In summary, 94% of analyzed users have privatized this information, since 6% of them
      shared it on other social networks;
    • Among the 5000 analyzed users who have privatized the Email on Facebook or are not
      registered to this social network, only 1% of them shared it on Linkedin, while 1% on
      VKontakte. In summary, 2% of analyzed users shared the Email on other social networks,
      so 98% have completely privatized it;
    • Among the 5000 users analyzed, 4721 users have privatized Education on Facebook or
      are not registered to this social network. Among them, 31% published this information
      on Linkedin, and 2% on VKontakte. In summary, 33% of analyzed users have shared the
      Education on other social networks, so 67% have completely privatized it.
Finally, most of the analyzed users who have privatized a given data on Facebook have also
privatized it on other social networks. Among all considered social networks, Linkedin has
proved to be helpful for the reconstruction of users’ information.
   The information that are most frequently accessible on Linkedin are Mobile phone, City, Date
of birth, Email, and Employment. We detail the percentage of information privatised on Linkedin,
but published on other social networks:
    • Similarly to Facebook, among the 5000 analyzed users who have privatized their mobile
      phone number on Facebook, or are not registered to this social network, no one published
      it on other social networks;
    • Among the 5000 users analyzed, 3450 have privatized their Hometown or Residence on
      Linkedin, or are not registered to this social network. Among them, 5% have published it
�       on Facebook, 2% on Twitter, and 1% on VKontakte. In summary, 8% of analysed users
       shared Hometown or Residence on other platforms, so 92% have completely privatised it;
    • Among the 5000 users analyzed, 4861 have privatized their Date of birth on Linkedin or
       are not registered to this social network. Among them, only 3% shared it on VKontakte.
       In summary, 3% of analyzed users shared the Date of birth on other social networks, while
       97% have completely privatized it;
    • Among the 5000 users analyzed, 4942 have privatized their Email on Linkedin or are
       not registered to this social network. Among them, only 1% shared it on VKontakte. In
       summary, 1% of analyzed users shared the Email on other social networks, while 99%
       have completely privatized it;
    • Among the 5000 users analyzed, 3445 have privatized their Training/Work on Linkedin
       or are not registered to this social network. Among them, 6% shared it on Facebook, and
       1% on VKontakte In summary, 7% of analyzed users shared the Training/Work on other
       social networks, so 93% have completely privatized it.
Finally, most of the analyzed users who have privatized a given data on Linkedin have also
privatized it on other social networks. Furthermore, among all considered social networks,
Facebook has proven to be helpful for the reconstruction of users’ information.
   The information that are most frequently shared on VKontakte are Mobile phone, City, Date
of birth, Email, and information concerning Training and Work. We detail the percentage of
information privatised on VKontakte, but published on other social networks:
    • Similarly to the previous analysis, among the 5000 analyzed users who have privatized
       their Mobile phone on VKontakte, or are not registered to this social network, no one
       published it on other social networks;
    • Among the 5000 users analyzed, 4990 have privatized their Hometown or Residence on
       VKontakte or are not registered to this social network. Among them, 30% of them have
       published it on Linkedin, 2% on Twitter, and 5% on Facebook. In summary, 37% of analysed
       users shared the Hometown or Residence on other social networks, so 63% have completely
       privatised it;
    • Among the 5000 users analyzed, 4832 have privatized their Date of birth on VKontakte or
       are not registered to this social network. Among them, only 3% of them have published it
       on Linkedin. In summary, 3% of analyzed users shared it on other social networks, so
       97% have completely privatized it;
    • Among the 5000 users analyzed, 4975 have privatized their Email on VKontakte or are not
       registered to this social network. Among them, only 1% of them shared it on Linkedin. In
       summary, 1% of analyzed users shared it on other social networks, so 99% have completely
       privatized it;
    • Among the 5000 users analyzed, 4997 have privatized their Education on VKontakte or
       are not registered to this social network. Among them, only 6% of them have published it
       on Facebook. In summary, 6% of analyzed users shared it on other social networks, so
       94% have completely privatized it;
    • Among the 5000 users analyzed, 4998 have privatized their Work on VKontakte or are
       not registered to this social network. Among them, 25.2% of them have published it on
       Linkedin, and 6.5% on Facebook. In summary, 31.7% of analyzed users shared it on other
       social networks, so 68.3% have completely privatized it.
�   Finally, most of the analyzed users who have privatized a given data on VKontakte have also
privatized it on other social networks, except for Employment, City of residence or Date of birth.
Among all considered social networks, Linkedin has proven to be helpful for the reconstruction
of users’ information.

6. Conclusion and Future directions
In our work, we have performed a single-social and a cross-social evaluation concerning users’
data to assess how easily they can be reconstructed from social networks. Our results highlight
that it is possible to obtain characterizing user’s information by analyzing his/her profile over
multiple platforms. Moreover, through the cross-social analysis, we also reconstructed other
significant users’ data by exploiting the combination of several social networks.
   In the future, we would like to collect more data concerning users by integrating information
over other social networks. Finally, we would also like to investigate the possibility of retrieving
information contained within users’ images by exploiting text recognition for gathering data.
References
 [1] P. R. M. Rao, S. M. Krishna, A. S. Kumar, Privacy preservation techniques in big data
     analytics: a survey, Journal of Big Data 5 (2018) 1–12.
 [2] P. Jain, M. Gyanchandani, N. Khare, Big data privacy: a technological perspective and
     review, Journal of Big Data 3 (2016) 1–25.
 [3] M. I. Pramanik, R. Y. Lau, M. S. Hossain, M. M. Rahoman, S. K. Debnath, M. G. Rashed,
     M. Z. Uddin, Privacy preserving big data analytics: A critical analysis of state-of-the-art,
     Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 11 (2021) e1387.
 [4] L. Caruccio, D. Desiato, G. Polese, Fake account identification in social networks, in: 2018
     IEEE international conference on big data (big data), IEEE, 2018, pp. 5078–5085.
 [5] S. Cirillo, D. Desiato, B. Breve, Chravat-chronology awareness visual analytic tool, in: 2019
     23rd International Conference Information Visualisation (IV), IEEE, 2019, pp. 255–260.
 [6] B. Breve, L. Caruccio, S. Cirillo, D. Desiato, V. Deufemia, G. Polese, Enhancing user
     awareness during internet browsing., in: ITASEC, 2020, pp. 71–81.
 [7] G. Bonifazi, E. Corradini, D. Ursino, L. Virgili, A social network analysis–based approach
     to investigate user behaviour during a cryptocurrency speculative bubble, Journal of
     Information Science (2021) 01655515211047428.
 [8] D. Desiato, G. Tortora, A methodology for gdpr compliant data processing., in: SEBD,
     volume 2161, 2018.
 [9] L. Caruccio, D. Desiato, G. Polese, G. Tortora, Gdpr compliant information confidentiality
     preservation in big data processing, IEEE Access 8 (2020) 205034–205050.
[10] F. Cerruto, S. Cirillo, D. Desiato, S. M. Gambardella, G. Polese, Social network data analysis
     to highlight privacy threats in sharing data, Journal of Big Data 9 (2022) 1–26.
[11] S. Adikari, K. Dutta, Identifying fake profiles in linkedin, arXiv preprint arXiv:2006.01381
     (2020).
[12] Y.-Q. Wang, An analysis of the viola-jones face detection algorithm, Image Processing On
     Line 4 (2014) 128–148.
�

Cross-Social Network Investigation to Highlight Privacy Violations in Data Sharing Activities[edit]

load PDF

Cross-Social Network Investigation to Highlight
Privacy Violations in Data Sharing Activities
Francesca Cerruto, Stefano Cirillo, Domenico Desiato, Simone Michele Gambardella
and Giuseppe Polese
University of Salerno, Department of Computer Science, Via Giovanni Paolo II, 132, 84084 Fisciano (SA), Italy


                                      Abstract
                                      Social networks represent a vast source of information and have an increasing impact on people’s daily
                                      lives. In fact, they permit to exhibit users’ lives, share emotions, passions, and interactions with other
                                      users around the world. These data need to be monitored because they could produce privacy violations,
                                      especially when they involve sensitive information. In this scenario, the definitions of privacy policies
                                      for safeguarding users’ data represent a difficult challenge that social networks have to deal with. In
                                      fact, although social network platforms offer privacy settings to protect data, often, users are unable
                                      to properly manage them to safeguard their privacy. To this end, in this work, we present a statistical
                                      investigation concerning privacy policies offered by social network platforms. In particular, we have
                                      defined a tool relying on image-recognition techniques capable of exploring social network platforms
                                      and identifying user profiles starting from their pictures. Moreover, we have composed a dataset of
                                      5000 users by retrieving their data available over different social network platforms in order to compare
                                      publicly accessible data provided in the registration phases, and those retrieved by our analysis. The
                                      proposed work underlines privacy violations over social network platforms when privacy policies are
                                      not managed correctly, and is targeted to improve the users’ awareness concerning the spreading and
                                      managing of their data. We have highlighted all the statistical evaluations made over the gathered data
                                      for putting in evidence the privacy issues.

                                      Keywords
                                      Privacy, Social Networks, Data Analysis


1. Introduction
Plenty of people are registered over several social networks, sharing a vast amount of infor-
mation. Moreover, social networks play a fundamental role in human interactions since they
permit people to share emotions, ways of thinking, points of view, and so on. In this scenario,
preserving users’ privacy is crucial for social network platforms since they cannot permit the
jeopardization of users’ data.
   Users tend to use social networks to share information massively, and in most cases, they do
not care about privatizing data and are unaware of the threats they can be exposed to. Moreover,
the growing number of users signing up on these platforms yields the necessity of analyzing
how they manage their privacy, mainly when using multiple social networks.
   Several studies have discussed data privacy issues on social networks [1, 2, 3, 4], but only

SEBD 2022: The 30th Italian Symposium on Advanced Database Systems, June 19-22, 2022, Tirrenia (PI), Italy
$ f.cerruto@studenti.unisa.it (F. Cerruto); scirillo@unisa.it (S. Cirillo); ddesiato@unisa.it (D. Desiato);
m.gambardella24@studenti.unisa.it (S. M. Gambardella); gpolese@unisa.it (G. Polese)
� 0000-0003-0201-2753 (S. Cirillo); 0000-0002-6327-459X (D. Desiato); 0000-0002-8496-2658 (G. Polese)
                                    © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073
                                    CEUR Workshop Proceedings (CEUR-WS.org)
�some of them have provided tools capable of improving users’ awareness when sharing their
data on social networking platforms [5, 6, 7]. In our work, we perform cross social network
analysis on five social network platforms to figure out which is the information that is most
frequently shared over social networks, and that can jeopardize user’s privacy [8, 9].
   In this discussion paper, we describe the tool SOcial Data Analyzer (SODA) proposed in
[10], which is able to find and extract available information of users on different platforms
considering only their photos. SODA allowed us to perform an accurate analysis for revealing
privacy threats linked to incorrect usage of data sharing in social networks. Moreover, the
tool allowed us to evaluate the sensitiveness of information shared by users and perform an
exhaustive analysis to understand how social networks can reconstruct users’ data even if some
of them are protected on other platforms.
   SODA is independent of the privacy settings offered by social networks since it simulates the
search of a real user and retrieves data publicly available in social network profiles. In other
words, if a user has privatized specific information over a specific social network, SODA cannot
retrieve that information. However, if the user has the same information not privatized over a
different social network, the SODA retrieves such information. Thus, we could say that (SODA)
tests the users’ skills in managing privacy settings offered by social network platforms.
   In summary, the main contributions of our study are 𝑖) a new tool capable of finding users
and extracting their data from different social network platforms, and 𝑖𝑖) a detailed analysis
of users’ data extracted from different social networks aiming to evaluate their privacy and
improve their awareness concerning privacy threats in social network platforms.
   The paper is organized as follows, in Section 2 we present our methodology, whereas Section
3 presents the architecture of SODA tool. In Sections 4 and 5 we describe the results of our
analysis. Finally, conclusions and future research directions are discussed in Section 6.

2. Methodology
In this section, we describe our methodology by summarising it in two meaningful steps: the
single- and the cross-social data extrapolation steps.
   In the single social data extrapolation step, the picture and the name of the users are exploited
as the input of the Social Module. Then, for each social network platform, the module performs
specific operations for scraping the content of the pages and extracting photos and names
associated with each profile. This information is given as input to a face recognition module
that tries to match the discovered photos and the initial user’s picture. If a match is found, the
social network module extracts all data available on a user profile.
   Similar to the single social step, the cross-social data extrapolation step starts by considering
the picture and the name of a user for searching a user from multiple platforms. However,
the main difference is the exploitation of multiple social network modules for extrapolating
several user profile data. In particular, each module extracts user profile data from each social
network platform according to its representation of the data. In this way, it is possible to collect
several user profile data from different social networks. Obviously, the only limitation is that the
target user needs to own a registered profile on each social network. Finally, all the user profile
data associated with each social network feed the integration module, which is responsible for
aggregating all the collected data.
�  The differentiation of the single- and cross-social analysis allows us to estimate the minimum
amount of user data that is possible to extract from each social network and evaluate the
maximum number of data that can be aggregated from different platforms. In the following
section, we first provide an overview of the new SODA system, and then we describe the
architecture underlying it.

3. Social Data Analyzer
Extracting user data from multiple social networks is a complex problem. There are several
issues related to extraction yielding specific choices for the components of the SODA tool: 𝑖)
the number of users involved in the analysis process can be large, 𝑖𝑖) each social network relies
on different implementation technologies, and 𝑖𝑖𝑖) continuous updates of the social network
platforms require continuous maintenance of system components. To this end, we have built the
tool SODA on top of the existing system Social Mapper1 , extending several of its components
aiming to tackle the issues mentioned above.
   In particular, Social Mapper is capable to search user profiles on multiple social network
platforms such as Facebook, Linkedin, Instagram, VKontakte, Twitter, Pinterest, Weibo, and
Douban. Because SODA is an extension of Social Mapper, it can search people by only con-
sidering an image and at least one information including name, surname, city, email, or the
company in which the user works. Starting from these, SODA exploits the Selenium framework
for creating a bot capable of automatically browsing web pages, by simulating the behaviors of
a real user during a web browsing session. In this way, SODA can exploit the search engines
behind each social network platform, by performing searching operations by means of the
search bars provided by each platform.
   With respect to Social Mapper, SODA provides several novel functionalities enabling to
perform an in-depth analysis of the data shared by users, and extend the search on a large scale.
The first new functionality allows SODA to find people working for a specific company, by
exploiting the search mechanism of Linkedin for selecting users that work in a given company.
As demonstrated in [11] the amount of fake users registered on Linkedin is very small, which
allowed us to create a dataset with a large number of real users. Most of the remaining extensions
concern the crawling components. In fact, Social Mapper is limited to only extracting the URLs
of the different user profiles. Thus, in SODA, we have re-designed the crawler modules aiming
to add several new navigation features capable of adapting to the different structures of the
web pages. The combination of these strategies with a powerful recognition algorithm, i.e.,
Viola-Jones [12], allows SODA to achieve accurate results on multiple platforms. It is important
to notice that, the face recognition algorithm return as output a user profile if and only if
the image is at least 60% compatible with the input one and if the data correspond with it.
This threshold ensures that the number of false positives is minimized. Figure 1 shows the
architecture of SODA. The data are acquired by the Parser component, which is responsible
for interpreting the system input, trying to understand the execution modes and for sharing
information of each user with the Face Recognition module. Moreover, the Parser invokes the
Browser Connector module interface, which enables SODA to execute the local web browser.
After which, it is necessary to interact with the web pages and extract information. To this
   1
       https://github.com/Greenwolf/social_mapper
�end, SODA exploits the functionalities provided by Selenium. More specifically, to extract
specific information on each social, we defined six modules, one for each social network on
which we can access user profiles and extract their information. In particular, SODA crawlers
search for a user by using the initial information read by the Parser module and extract all the
profile pictures of the users that match the search criteria. The list of pictures is sent to the Face
Recognition component, which compares the image taken in input with those extracted from
the social networks in order to identify the correct subjects to be analyzed. The list of identified
subjects is shared with the crawling modules, which acquire all information of each profile,
storing them locally. Finally, the Aggregator component receives all the data and groups all the
information extracted by the crawlers in a single file.
         SODA
                                                                            http://localhost:3000/index.html




                                                                                                                                user profile
                                                                                                                                     photos
           Social Mapper

                                                                                                                         user profile
                                                                                                                            photos             Selenium
                 User        user data             Parser                          Browser Connector                                           Crawler
                 Data      link to local         Component                              Module
                              photo


                           link to local photo                                                                                          Facebook          Twitter
                                                               user photo              Face Recognition




                                                                                                                                                                      crawler modules
                 User                                                                     Component                                     crawler           crawler
                Photos         user photo
                                                                                                                                        Linkedin          Pinterest
                               Users Profile                                                 Aggregator        user profile             crawler           crawler
                                   Data                                                      Component            data
                                                                                                                                        Instagram         Vkontakte
                                                                                                                                         crawler           crawler
          data layer                                  business layer



Figure 1: Architecture of SODA.

4. Evaluation single-social
The single social evaluation allowed us to highlight the information that is frequently shared
by users over every single social network and analyze how each social network preserves
user privacy. To perform this type of analysis, we have created a dataset of 5000 users, by
considering people working for a specific company that has been extracted by means of a new
feature of SODA enabling search people working for a specific company. Starting from the
5000 users contained in our dataset, we perform a single social network evaluation with the
aim to independently analyze the results obtained by each social network, avoiding considering
whether a user is present on multiple platforms, which will be discussed in Section 5. It is
important to note that the number of people evaluated for each social network corresponds to
those who were actually found by SODA after exploring the different platforms.
   Concerning Linkedin, 1570 user profiles have been evaluated. In particular, Employment
and the City are the most frequently shared information on Linkedin. More specifically, the
attribute city can refer to the place of residence or the place of birth, but in most cases, these
are equal. Results highlight even more that Linkedin is a social network for job finding where
users tend to share their employment and city, aiming to find better job opportunities.
   Concerning Facebook, 1161 user profiles have been evaluated. In particular, the basic infor-
mation related to the gender, Education or Work, and the Place where the user lives are the most
�frequently shared information on this social network. However, no user has shared his/her
details on the date of birth which, combined with the other data, could significantly affect
privacy. Facebook permits users to hide their date of birth in order to preserve privacy.
   Concerning Twitter, 86 user profiles have been evaluated. Despite not many users involved in
the analysis, the City, Website, and the Biography of a user are the most shared information on
this social network. In particular, through the biography, a user can share additional information,
such as his/her telephone number, email, or other information. Twitter is used by many famous
people, but it offers less prevention in terms of privacy, mainly due to the fact that users tend to
insert data in their biography, not being aware to disclose them.
   Concerning VKontakte, 251 user profiles have been evaluated. In particular, the Date of birth,
the Spoken languages, and the Education information are the most frequently shared data on
this social network. More specifically, not many users have shared their Telephone numbers.
As Facebook, also VKontakte is a social network that allows users to share a vast amount of
information, and it permits users to hide specific details to preserve privacy.
   Concerning Pinterest and Instagram, 1688 and 2845 user profiles have been evaluated. In
particular, these two social networks are massively used for sharing photos, and no other
types of data have been found for our analysis. Furthermore, the only textual information on
Instagram that seemed helpful in our analysis was the user biography. Yet, a user can write
anything in it, so we have decided not to take the biography into account for our analysis.

5. Evaluation cross-social
In this section, we describe the statistics derived by performing a cross-social analysis of the
publicly accessible information extracted from available social networks, and we investigate
the possibility of aggregating them aiming to perform a more detailed analysis.
   Figure 2 shows the distribution diagram for the users registered over the considered social
network platforms. In particular, except for the first bar that highlights the number of users
involved in no social networks, it is possible to group the other bars in three blocks, representing
the users found in one, two, and three social network platforms, respectively.

                           103
         Number of users




                           102

                           101

                           100
            Twitter
        VKontakte
         Facebook
          LinkedIn
          Pinterest
        Instagram


Figure 2: Distribution diagram of the analyzed users.
   A cross-social analysis permits the reconstruction of information over different social net-
works. For example, a user registered on several social networks can decide to privatize some
information on a specific social network, where s/he can choose to unmask the same information
�over other social networks. It means that it is possible to obtain more detailed information by
analyzing a specific user over different social networks.
   The most frequently accessed information on Twitter is the city since it can be reconstructed
through other social networks. In particular, 4923 users out of 5000 analyzed are not registered
on Twitter or have privatized this information on it. However, 31% of 4923 users published
their city on Linkedin, while 5% on Facebook, and 1% on VKontakte. The remaining 63% out of
4923 users did not share this information over any considered social network, leading to the
impossibility of extracting the information concerning their city. Consequently, only in the last
case, it is possible to guarantee the confidentiality of the data (e.g., city), by simply requiring
the management of its privatization over just one social network (e.g., Twitter).
   The information that is most frequently accessible on Facebook is Mobile phone, City, Date
of birth, Email, and information concerning Education and Training or Work. For our analysis
on Facebook, we have merged the last two attributes. We detail the percentage of information
privatized by Facebook users but published on other social networks:
    • Among the 5000 analyzed users who have privatized their Mobile number on Facebook,
      no one has allowed the reconstruction of this information from other social networks;
    • Among the 5000 users analyzed, 4743 have privatized their Hometown or Residence on
      Facebook or are not registered to this social network. Among them, 31% have published
      this information on Linkedin, 2% on Twitter, and 1% on VKontakte. Thus, 34% of them
      allow the reconstruction of this information from other social networks;
    • Among 5000 analyzed users who have privatized their Date of birth on Facebook or are
      not registered to this social network, 3% shared it on VKontakte, and 3% on Linkedin.
      In summary, 94% of analyzed users have privatized this information, since 6% of them
      shared it on other social networks;
    • Among the 5000 analyzed users who have privatized the Email on Facebook or are not
      registered to this social network, only 1% of them shared it on Linkedin, while 1% on
      VKontakte. In summary, 2% of analyzed users shared the Email on other social networks,
      so 98% have completely privatized it;
    • Among the 5000 users analyzed, 4721 users have privatized Education on Facebook or
      are not registered to this social network. Among them, 31% published this information
      on Linkedin, and 2% on VKontakte. In summary, 33% of analyzed users have shared the
      Education on other social networks, so 67% have completely privatized it.
Finally, most of the analyzed users who have privatized a given data on Facebook have also
privatized it on other social networks. Among all considered social networks, Linkedin has
proved to be helpful for the reconstruction of users’ information.
   The information that are most frequently accessible on Linkedin are Mobile phone, City, Date
of birth, Email, and Employment. We detail the percentage of information privatised on Linkedin,
but published on other social networks:
    • Similarly to Facebook, among the 5000 analyzed users who have privatized their mobile
      phone number on Facebook, or are not registered to this social network, no one published
      it on other social networks;
    • Among the 5000 users analyzed, 3450 have privatized their Hometown or Residence on
      Linkedin, or are not registered to this social network. Among them, 5% have published it
�       on Facebook, 2% on Twitter, and 1% on VKontakte. In summary, 8% of analysed users
       shared Hometown or Residence on other platforms, so 92% have completely privatised it;
    • Among the 5000 users analyzed, 4861 have privatized their Date of birth on Linkedin or
       are not registered to this social network. Among them, only 3% shared it on VKontakte.
       In summary, 3% of analyzed users shared the Date of birth on other social networks, while
       97% have completely privatized it;
    • Among the 5000 users analyzed, 4942 have privatized their Email on Linkedin or are
       not registered to this social network. Among them, only 1% shared it on VKontakte. In
       summary, 1% of analyzed users shared the Email on other social networks, while 99%
       have completely privatized it;
    • Among the 5000 users analyzed, 3445 have privatized their Training/Work on Linkedin
       or are not registered to this social network. Among them, 6% shared it on Facebook, and
       1% on VKontakte In summary, 7% of analyzed users shared the Training/Work on other
       social networks, so 93% have completely privatized it.
Finally, most of the analyzed users who have privatized a given data on Linkedin have also
privatized it on other social networks. Furthermore, among all considered social networks,
Facebook has proven to be helpful for the reconstruction of users’ information.
   The information that are most frequently shared on VKontakte are Mobile phone, City, Date
of birth, Email, and information concerning Training and Work. We detail the percentage of
information privatised on VKontakte, but published on other social networks:
    • Similarly to the previous analysis, among the 5000 analyzed users who have privatized
       their Mobile phone on VKontakte, or are not registered to this social network, no one
       published it on other social networks;
    • Among the 5000 users analyzed, 4990 have privatized their Hometown or Residence on
       VKontakte or are not registered to this social network. Among them, 30% of them have
       published it on Linkedin, 2% on Twitter, and 5% on Facebook. In summary, 37% of analysed
       users shared the Hometown or Residence on other social networks, so 63% have completely
       privatised it;
    • Among the 5000 users analyzed, 4832 have privatized their Date of birth on VKontakte or
       are not registered to this social network. Among them, only 3% of them have published it
       on Linkedin. In summary, 3% of analyzed users shared it on other social networks, so
       97% have completely privatized it;
    • Among the 5000 users analyzed, 4975 have privatized their Email on VKontakte or are not
       registered to this social network. Among them, only 1% of them shared it on Linkedin. In
       summary, 1% of analyzed users shared it on other social networks, so 99% have completely
       privatized it;
    • Among the 5000 users analyzed, 4997 have privatized their Education on VKontakte or
       are not registered to this social network. Among them, only 6% of them have published it
       on Facebook. In summary, 6% of analyzed users shared it on other social networks, so
       94% have completely privatized it;
    • Among the 5000 users analyzed, 4998 have privatized their Work on VKontakte or are
       not registered to this social network. Among them, 25.2% of them have published it on
       Linkedin, and 6.5% on Facebook. In summary, 31.7% of analyzed users shared it on other
       social networks, so 68.3% have completely privatized it.
�   Finally, most of the analyzed users who have privatized a given data on VKontakte have also
privatized it on other social networks, except for Employment, City of residence or Date of birth.
Among all considered social networks, Linkedin has proven to be helpful for the reconstruction
of users’ information.

6. Conclusion and Future directions
In our work, we have performed a single-social and a cross-social evaluation concerning users’
data to assess how easily they can be reconstructed from social networks. Our results highlight
that it is possible to obtain characterizing user’s information by analyzing his/her profile over
multiple platforms. Moreover, through the cross-social analysis, we also reconstructed other
significant users’ data by exploiting the combination of several social networks.
   In the future, we would like to collect more data concerning users by integrating information
over other social networks. Finally, we would also like to investigate the possibility of retrieving
information contained within users’ images by exploiting text recognition for gathering data.
References
 [1] P. R. M. Rao, S. M. Krishna, A. S. Kumar, Privacy preservation techniques in big data
     analytics: a survey, Journal of Big Data 5 (2018) 1–12.
 [2] P. Jain, M. Gyanchandani, N. Khare, Big data privacy: a technological perspective and
     review, Journal of Big Data 3 (2016) 1–25.
 [3] M. I. Pramanik, R. Y. Lau, M. S. Hossain, M. M. Rahoman, S. K. Debnath, M. G. Rashed,
     M. Z. Uddin, Privacy preserving big data analytics: A critical analysis of state-of-the-art,
     Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 11 (2021) e1387.
 [4] L. Caruccio, D. Desiato, G. Polese, Fake account identification in social networks, in: 2018
     IEEE international conference on big data (big data), IEEE, 2018, pp. 5078–5085.
 [5] S. Cirillo, D. Desiato, B. Breve, Chravat-chronology awareness visual analytic tool, in: 2019
     23rd International Conference Information Visualisation (IV), IEEE, 2019, pp. 255–260.
 [6] B. Breve, L. Caruccio, S. Cirillo, D. Desiato, V. Deufemia, G. Polese, Enhancing user
     awareness during internet browsing., in: ITASEC, 2020, pp. 71–81.
 [7] G. Bonifazi, E. Corradini, D. Ursino, L. Virgili, A social network analysis–based approach
     to investigate user behaviour during a cryptocurrency speculative bubble, Journal of
     Information Science (2021) 01655515211047428.
 [8] D. Desiato, G. Tortora, A methodology for gdpr compliant data processing., in: SEBD,
     volume 2161, 2018.
 [9] L. Caruccio, D. Desiato, G. Polese, G. Tortora, Gdpr compliant information confidentiality
     preservation in big data processing, IEEE Access 8 (2020) 205034–205050.
[10] F. Cerruto, S. Cirillo, D. Desiato, S. M. Gambardella, G. Polese, Social network data analysis
     to highlight privacy threats in sharing data, Journal of Big Data 9 (2022) 1–26.
[11] S. Adikari, K. Dutta, Identifying fake profiles in linkedin, arXiv preprint arXiv:2006.01381
     (2020).
[12] Y.-Q. Wang, An analysis of the viola-jones face detection algorithm, Image Processing On
     Line 4 (2014) 128–148.
�
🖨 🚪