Vol-3194/paper11

Wolfgang Fahl

Paper

Paper
edit
description
id	Vol-3194/paper11
wikidataid	Q117344953→Q117344953
title	Understanding RDF Data Representations in Triplestores
pdfUrl	https://ceur-ws.org/Vol-3194/paper11.pdf
dblpUrl	https://dblp.org/rec/conf/sebd/LissandriniSPH22
volume	Vol-3194→Vol-3194
session	→

Paper[edit]

Paper
edit
description
id	Vol-3194/paper11
wikidataid	Q117344953→Q117344953
title	Understanding RDF Data Representations in Triplestores
pdfUrl	https://ceur-ws.org/Vol-3194/paper11.pdf
dblpUrl	https://dblp.org/rec/conf/sebd/LissandriniSPH22
volume	Vol-3194→Vol-3194
session	→

Understanding RDF Data Representations in Triplestores[edit]

The last editor of this page did not have the right to Embed PDFs into pages.

Understanding RDF Data Representations in
Triplestores*
Matteo Lissandrini1 , Tomer Sagi1 , Torben Bach Pedersen1 and Katja Hose1
1
    Aalborg University, Denmark


                                         Abstract
                                         Because of the flexibility and expressiveness of their model, Knowledge Graphs (KGs) have received
                                         increasing interest. These resources are usually represented in RDF and stored in specialized data
                                         management systems called triplestores. Yet, while there exists a multitude of such systems, exploiting
                                         varying data representation and indexing schemes, it is unclear which of the many design choices are
                                         the most effective for a given database and query workload. Thus, first, we introduce a set of 20 access
                                         patterns, which we identify within 6 categories, adopted to analyze the needs of a given query workload.
                                         Then, we identify a novel three-dimensional design space for RDF data representations built on the
                                         dimensions of subdivision, redundancy, and compression of data. This design space maps the trade-offs
                                         between different RDF data representations employed to store RDF data within a triplestore. Thus, each
                                         of the required access patterns is compared against its compatibility with a given data representation. As
                                         we show, this approach allows identifying both the most effective RDF data representation for a given
                                         query workload as well as unexplored design solutions.

                                         Keywords
                                         RDF Data Management, Knowledge Graphs, Data Management Systems Design




1. Introduction
Knowledge Graphs (KGs) are nowadays already in widespread use because of their ability to
represent entities and their relationships in many domains [2, 3]. KGs are usually represented as
RDF data1 . RDF models a graph as a set of subject-predicate-object triples, such that nodes serve
as subjects and objects and edge labels as predicates. RDF data is usually stored in specialized
data management systems called triplestores supporting declarative queries written in the
SPARQL language. In parallel to the growing interest in KGs, we have seen the emergence of
many different triplestore implementations, either from academic prototypes (e.g., RDF-3X),
community projects (e.g., JENA TDB), or commercial products (e.g., Virtuoso, and GraphDB).
Nonetheless, each system employs its own peculiar set of design choices, each appearing equally
compelling at a first glance. Here, we provide a better understanding of the pros and cons of the

SEBD 2022: The 30th Italian Symposium on Advanced Database Systems, June 19-22, 2022, Tirrenia (PI), Italy
*The full paper was originally published in the VLDB journal [1].
" matteo@cs.aau.dk (M. Lissandrini); tsagi@cs.aau.dk (T. Sagi); tbp@cs.aau.dk (T. B. Pedersen); khose@cs.aau.dk
(K. Hose)
� 0000-0001-7922-5998 (M. Lissandrini); 0000-0002-8916-0128 (T. Sagi); 0000-0002-1615-777X (T. B. Pedersen);
0000-0001-7025-8099 (K. Hose)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR

         CEUR Workshop Proceedings (CEUR-WS.org)
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073




                  1
     The property graph [4] model has also been proposed, but this model has still not converged in a common
standard and is not usually adopted to store KGs.
�various design choices w.r.t. the needs of a given use case and query workload. We specifically
focus on the existing solutions for storing, indexing, and querying RDF data. Our analysis
complements existing surveys [5, 6], which merely provide a taxonomy of the features offered by
the systems (e.g., API and data loading facilities) or classify them according to their underlying
technology (e.g., relational versus native graph), but fail short in providing a detailed analysis of
the design space in which their internal architectures reside. At the same time, this study, also
complements existing efforts in improving the performances of these system by optimizing their
query processing techniques [7]. Hence, we first characterize the feature space for query access
patterns, i.e., the logical operations in a query that, given some input values, establish what kind
of output values must be returned. Then, we uniquely identify the defining characteristics of
existing RDF data representations within a unifying three-dimensional design space across three
dimensions: Subdivision, Compression, and Redundancy (SCR). The design space defines the
dimensions along which any storage system for RDF data must be designed. Thus, our approach
identifies the most suitable data representation for a given access pattern, as well as access
patterns that are currently not optimally supported by the existing representations. We thereby
make the following contributions: (i) a feature space of 20 access patterns within 6 categories
for SPARQL queries (Table 1); (ii) a design space for RDF data representations based on data
subdivsion, redundancy, and compression (Table 2), which allows analyzing the trade-offs of
each design choice; and finally (iii) a thorough methodology to analyze how different design
choices impact system performance when answering access patterns with specific features. In
the following, we first introduce the core concepts regarding the RDF data model (Section 2) and
the features of the SPARQL query language (Section 3). Then, we present our SCR design space
(Section 4) and conclude explaining how it allows deriving a number of important findings
about the data representations adopted by existing systems (Section 5).


2. Preliminaries: RDF & SPARQL
The RDF data model revolves around the concepts of RDF triples and RDF graph patterns. An
RDF triple, which corresponds to an edge in the graph, is a factual statement consisting of a
subject (𝑠), a predicate (𝑝), and an object (𝑜). Subjects and objects, which represent nodes, can
be resources identified by International Resource Identifiers (IRI), anonymous nodes identified
by internal IDs (called blank nodes), or literals (which can appear only as objects). Thus, an
RDF graph 𝒢 is a set of RDF triples. An RDF graph is queried by issuing a SPARQL query whose
answers are computed based on the subgraphs matching the structural and logical constraints
it prescribes. SPARQL queries contain one or more basic graph patterns (BGPs), which are sets
of triples with zero or more of their components replaced with variables from an infinite set
of variables 𝒳 . Thus, triple patterns are (𝑠, 𝑝, 𝑜) triples, where any position may be replaced
by a variable 𝑥∈𝒳 . We refer to each value in each position as an atom. Moreover, there are
special types of patterns that match pairs of atoms connected through sequences of edges
satisfying a specific set of predicates, e.g., all nodes connected by an arbitrary sequence of edges
with ex:childOf predicates. These patterns are called property paths. They are defined via a
specialized form of expressions called path expressions (similar to regular expressions) and offer
a succinct way to extend matching of triple patterns to paths of arbitrary length [8]. Finally,
�                                                                           Return Values
          SELECT ?event ?eventLabel
                                                                           Constants PO
          WHERE {
                                                                            Constants P
               ?event wdt:P31/wdt:P279* wd:Q1190554. ★ ✪
                                                                           Range Closed
               ?event wdt:P585|wdt:P580 ?date. ★
                                                                           Range Special
               ?event rdfs:label ?eventLabel.
                                                                          Pivot N-way S≡S
               FILTER(DATATYPE(?date) = xsd:dateTime).
                                                                           ★ 1-hop SàO
               FILTER(?date <= 2019-12-31 && ?date > 2019-12-01).
                                                                           ★ 1-hop OàS
               FILTER(LANG(?eventLabel) = "en").                           ✪ SàP*àO
          }
Figure 1: SPARQL query with annotated access patterns.


solutions of a query are the variables bindings, i.e., the mappings from each variable to a given
node or edge in the graph that satisfies the conditions in the BGP.
   In its simplest form, a SPARQL query has the form “SELECT 𝑉          ⃗ WHERE 𝑃 ”, with 𝑃 =
{𝑡1 , ... , 𝑡𝑛 } being a set of triple patterns (as in Figure 1, the annotations are explained in
the next section). Optionally, one or more FILTER clauses further constrain the variables in
𝑃 . Let 𝒳𝑃 denote the finite set of variables occurring in 𝑃 , i.e., 𝒳𝑃 ⊂𝒳 , then 𝑉⃗ is the vector
of variables returned by the query such that 𝑉     ⃗ ⊆ 𝒳𝑃 . Additional operators such as UNION
or OPTIONAL allow more than one BGP in a query by defining the non-conjunctive semantics
of their combination. Finally, SPARQL queries can also make use of GROUP BY and aggregate
operators. SPARQL queries are declarative and are therefore designed to be decoupled from the
physical data access methods, i.e., the low-level algorithms and data structures for organizing
and accessing data. This decoupling allows specific triplestore implementations to use different
data representations and query processing techniques to execute a given query.


3. SPARQL Access Patterns
To identify the best internal representation for an RDF dataset and a given SPARQL query, we
decompose the query’s triple patterns into atomic operators and identify how they access a
given data representation. In particular, we identify a set of access patterns that enable the
required translation from the declarative query language to its logical query plan. Thus, an
access pattern is the set of logical operations that, given a set of variable bindings over a graph,
determine what data to access (and optionally to change) and what output to produce. The
feature space of access patterns (summarized in Table 1) is comprised of 6 dimensions, each
containing a set of alternative features, namely: Constants, the presence of constant values as
atoms (instead of variables); Filter, the presence (or absence) of a filter on a range of values;
Traversal, the connection of two nodes by means of one or more subsequent triple patterns;
Pivot, the atom connecting multiple triple patterns in a BGP; Return, the information expected
to be returned; Write, whether and how a change in the content of the graph is needed.
Consider the example in Figure 1, representing a query to retrieve location coordinates of
archaeological sites, and compare it to the access patterns in Table 1. We can recognize, for
instance, the property path wdt:P31/wdt:P279*, of the form p1/p2* o, meaning that it
�Table 1
Feature space of access patterns for a SPARQL query.

 Dimension   Features

 Constants   𝑆, 𝑃, 𝑂                            𝑆𝑃, 𝑆𝑂, 𝑃 𝑂                          𝑆𝑃 𝑂
             1 Constant                         2 Constants                          3 Constants
 Filter      <&>                                <|>                                  T
             Closed range                       Open range                           Special type range
                                                   k
 Traversal   𝑠→𝑜                                𝑠→− 𝑜                                𝑝*/+
             1-hop over 𝑝                       k-hop over 𝑝1 , . . . , 𝑝𝑘           ?-hop over 𝑝
 Pivot       𝑠≡𝑠 / 𝑜≡𝑜 / 𝑝≡𝑝                    𝑜≡𝑠 / 𝑜≡𝑝 / 𝑠≡𝑝                      𝑣0 ≡𝑣1 ∧ 𝑣0 ≡𝑣2 ∧ ...𝑣0 ≡𝑣𝑁
             binary; same S/P/O position        binary; different S/P/O positions    N-way; any S/P/O positions
 Return      Values (all)   Values (distinct)   Values (sorted)          Existence   Aggregate

 Write       Update                             Insert                               Delete


would match two alternative access patterns: (1) 1-hop traversals over p1 = wdt:P31 reaching
the target node o=wd:Q1190554 directly, and (2) *-hop traversals starting with one edge for
wdt:P31 and reaching the object through a sequence of arbitrarily long paths matching the p2
= wdt:P279 triple pattern. On the other hand, the ?event variable is a 3-way pivot, which
can also be executed as a set of binary 𝑠≡𝑠 pivot access patterns. Similarly, we can also identify
range filters and the presence of constant values. Each of these access patterns can be effectively
implemented and answered by specific RDF data representations and indexes.


4. RDF Data Representations in the SCR Space
To support the data access patterns, triplestores implement a set of different data representations
storing all or a subset of the graph 𝒢. A data representation 𝒟 to answer the given access
pattern 𝒜 needs to provide a way to retrieve the values for the bindings of the variables 𝑉 ⃗ from
the information stored in 𝒢 (or to modify 𝒢 for write operations). Therefore, given an access
pattern 𝒜 and a data representation 𝒟 of 𝒢, we evaluate the performance of 𝒟 in providing the
correct instantiations of 𝑉
                          ⃗ for 𝒜. Moreover, we also evaluate how effective, in terms of time, it
is to execute 𝒜 over 𝒟, i.e., its cost. Answering such a question allows, given two distinct data
representations, to select the most appropriate one, i.e., the one with the lowest cost.
   Cost model. Answering a query using a particular data representation incurs a cost expressed
in terms of the time it takes to retrieve all such answers. Several such cost models have
been proposed for RDF [9, 10], but were tied to a simple two-tiered memory hierarchy and
assumed a relational data representation. While such assumptions are no longer applicable,
it still holds across novel (existing or future) machine architectures and memory types that
random seek operations incur a different cost than sequential read operations. Additionally, we
identify another set of costs derived from the difference between accessing compressed and
uncompressed data. That is, while the use of compression is employed to minimize memory
requirements and speed up the movement of large result sets, it incurs additional costs in
compression and decompression. Therefore, we follow the recent convention of employing a
�            ?s    ?p        ?o      b)     ?s                                    c)     ?s     <p1> <p2> <p3> . . .
     a)                                             <p1>,<o1>    <p2>,<o2>            <uri1> <o1>      <o2>   -   ...
          <uri1> <p1> <o1>               <uri1>
          <uri1> <p2> <o2>                                                            <uri2> <o3>       -   <o4> ...
                                         <uri2>     <p1>,<o3>    <p3>,<o4>            <uri3> <o1>       -     -   ...
          <uri2> <p1> <o3>               <uri3>
          <uri2> <p3> <o4>                                                            <uri4>    -      <o5>   -   ...
                                         <uri4>     <p1>,<o1>
          <uri3> <p1> <o1>                                                              ...    ...      ...  ... ...
                                          . . .
          <uri4> <p2> <o5>                                                            <uri22> <o14>     -   <o15> ...
                                         <uri22>    <p2>,<o5>
           . . . . . . . . .
          <uri22> <p1> <o14>                                     <p3>,<o15>
          <uri22> <p3> <o15>                        <p1>,<o14>


                       d)        <p1><uri22><o14>    <p3><uri1><o4>



                            <p1><uri3><o1>               <p2><uri1><o2>       <p2><uri4><o5>



                       <p1><uri1><o1>      <p1><uri2><o3>        <p1><uri5><o6>       <p1><uri6><o7>

Figure 2: Common representations: a) sorted file, b) hash map, c) table, and d) B+ tree

set of cost constants [11] differentiating between the cost of random and sequential access and
between the cost of compressed and uncompressed access.
   Data Representation Compatibility. To assess if a data representation can efficiently
support a specific access pattern, we define the notion of compatibility. We define this notion
specifically for read operations, under the assumption that a read operation is required before
deletion (to locate the data to be deleted) and before insertion (to locate where to insert the data
and to avoid duplicates) and is therefore required for every type of access pattern. Therefore,
given an access pattern 𝒜 and a data representation 𝒟 able to answer 𝒜, we assume that there
is a sequence of operations specified by an algorithm Γ defined over 𝒟 that can compute an
answer 𝒮, such that 𝒮 = Γ(𝒜, 𝒟). The number and cost of the operations specified by Γ directly
determine whether 𝒟 is a representation suitable for efficiently computing the answers to 𝒜.
When measuring the number of operations required to be executed over 𝒟, we distinguish
between random seek and sequential operations. In particular, an RDF data representation can
be seek compatible, sequence compatible, or selection compatible.
   For instance, one way to store 𝒢 is as a list of 3-tuples (one per edge) sorted by subject and
then by predicate and object (Figure 2.a and 2.d) with a clustered B+ tree index over them. In this
representation, the query processing cost would be analogous to that of a relational table with
three attributes (𝑠, 𝑝, 𝑜), all part of a primary index. This representation is sequence compatible
with any 1-hop access pattern that binds 𝑠, both 𝑠 and 𝑝, or all three positions. That is, the
algorithm Γ(⟨s1,p1,?o⟩, {B+tree,sorted(𝒢)}), would first find the first tuple performing 𝑙𝑜𝑔(𝒢)
steps traversing the B+ tree looking for s1, p1, and then perform a linear scan over the file to
retrieve the remaining tuples. However, the B+ tree is not seek compatible because the time to
find the first result depends on the size of the graph, i.e., it involves 𝑙𝑜𝑔(|𝒢|) seek operations to
reach the first result. A different data representation is to employ a key-value data structure
(similar to Figure 2b) and use the pair subject-predicate as the key, and the object as the value.
In this data structure, triples sharing the same 𝑠 and 𝑝 will store the list of objects contiguously.
This data structure is both seek and sequence compatible for a traversal that, given 𝑠 and 𝑝,
retrieves all corresponding objects. Nevertheless, this representation is neither seek compatible
�Table 2
Summary of data representation design space axes.
 Axis          Minimal         Ex-   Maximal Extreme            Positive Effect     Negative Effect
               treme
 Subdivision   Unsorted file         Pointers between every     ↓ # unneeded data   ↑ # random seeks
                                     related data item          items
 Compression   No compression        Compress all subdivi-      ↓ # read bytes      ↑ Decompression cost
                                     sions in all representa-
                                     tions
 Redundancy    Single represen-      One representation for     ↓ # random seeks    ↑ maintenance cost, storage
               tation                each possible BGP                              cost, query optimization time


nor sequence compatible if the query requires all edges for predicate 𝑝 regardless of 𝑠. Thus,
the definitions of cost and compatibility presented above allow analyzing the advantages and
drawbacks of the different choices in each dimension of the design space.
   The Design Space Dimensions. When designing data representations for an RDF store, one
can model the design decisions over three axes: Subdivision, Compression, and Redundancy. Each
of these orthogonal axes, whose properties are summarized in Table 2, represents a continuum
along which the data representation adopted by a system can be positioned.
   The Subdivision axis determines how fragmented the data is. At one extreme, all data is
stored contiguously in a single structure; at the other extreme, each edge and node is stored as
a separate object with pointers to its neighbors. In between, there are various data structures
such as B+ trees or hash maps. This axis contains design decisions such as sorting, grouping,
and hashing. Each of these decisions creates an additional subdivision in the data. Increasing
the extent of subdivision, allows us to minimize the number of unneeded data items accessed
to answer an access pattern (resulting in fewer operations that return items not in the result
set). For instance, in a sorted file (Figure 2a), the sorting keys act as the simplest subdivision
by collecting all triples with the same value together. On the other hand, with a hash table
(Figure 2b), given the target IRI, the hash function separates all relevant triples sharing the same
key. Therefore, decisions across the subdivision axis easily determine whether a data structure
is selection compatible, i.e., if it binds all and only the answers within a specific subdivision, but
it also impacts random and sequence compatibility.
   The goal of Compression is to minimize the number of bits read to reach the first tuple in the
result set and the number of bits required to read and potentially store the result set for further
processing. The potentially negative impact of compression is, of course, the decompression
required to evaluate a predicate in the access pattern if the access pattern is incompatible or
partially incompatible. For example, consider a compressed data representation tuned for queries
of the form (𝑠, ?𝑝, ?𝑜) and 𝑠≡𝑠 access patterns (e.g., BitMat [12]). Using this representation
to answer an 𝑠 𝑝*/+ 𝑜 pattern (i.e., is 𝑜 reachable from 𝑠 through edges labeled with 𝑝) would
require a potentially large number of row decompression operations to identify the edges that
satisfy the traversal condition. Moreover, decisions across the compression axis heavily impact
the selection compatibility when data that does not contribute to the answer are compressed
together with data relevant for an access pattern.
   The third axis is Redundancy which causes (redundant) copies of the data to be stored in the
system. By adding redundant data representations and indexes, it is possible to define ideal (seek,
�sequence, and selection compatible) data representations for each access pattern. However,
this comes at the cost of having to store the same information multiple times. For example, by
holding both an SPO clustered index and a PSO clustered index, each triple is stored twice. Thus,
this hinders the compatibility with write access patterns. Moreover, design decisions including
full/partial replication need to find a trade-off between storage space and efficient support of
query access patterns. Hence, the cost of maintaining multiple representations is three-fold: (i)
Increased latency for delete and insert operations with possibly reduced performance of read
operations as well. (ii) An increase in space requirements, subsequently straining the limited
space in main memory and causing an increase in all read costs. (iii) An increase in query
optimization time because alternative structures to access the data result in a higher number of
query execution plans that have to be considered.


5. The Capabilities of the SCR Space Analysis
Thanks to the feature space for analyzing query workload access patterns, and the Subdivision-
Compression-Redundancy (SCR) design space of data representations for RDF databases, we
enable the analysis of RDF store design decisions, specifically, which data representations can
effectively support a given workload. Previous workload analyses are centered around those
characteristics that would impact mainly the query optimizer. Instead, we present an analysis
of the access patterns specific to the storage layer. The effect of various choices along the three
axes is summarized in Table 2. We performed an analysis of popular RDF store benchmarks
under these assumptions and identified a number of workload requirements for which systems
rarely provide effective optimizations. Considering the Subdivision dimension, many systems
implement dedicated tables for triples sharing a specific predicate. Alternatively, queries that
match triples with the same subject can usually exploit the clustered representation where
several properties for the same subject (or object) are stored together in a single row [13].
Furthermore, a constant in 𝑠 or 𝑜 can be exploited by a hash index of the form 𝑠 ↦→𝑝𝑜. Yet, no
representation exploits subdivision for a pair of constants, e.g., 𝑠𝑝 ↦→𝑜. Overall, we see that
there is a strong incompatibility between filters on literals and the fact that all representations
store literals and special types contiguously. We also perform a similar analysis also across
the Redundancy dimension. For instance, specialized indexes have been proposed to speed
up reachability and multi-hop path expressions (e.g., Sparqling kleene [14]). Yet, in existing
systems, data representations supporting the traversal operations required by k-hop and *-hop
access patterns are rarely employed. Our survey also reveals the absence of index Compression
solutions (e.g., [15]). For instance, we do not find any application of effective compression
solutions for secondary representations, such as counting indexes (e.g., sp→#) and structures
like bloom-filters [16] or compact LSM-tries like SuRF [17].
   Thus, researchers can use the SCR design space to design novel solutions in an informed
manner. Such designs can be achieved either manually or even semi-automatically as proposed
by the recent effort in self-designing data structures [11] and self-organizing database systems
[18]. By cross-referencing the access patterns in a query workload with SCR design options,
future RDF stores will be able to add components of the system to match the expected workload.
�Acknowledgments
This research is supported by the European Union H2020 research and innovation program under
the Marie Skłodowska-Curie grant agreement No 838216 and the Poul Due Jensen Foundation.


References
 [1] T. Sagi, M. Lissandrini, T. B. Pedersen, K. Hose, A design space for rdf data representations,
     The VLDB Journal (2022) 1–27.
 [2] N. Noy, Y. Gao, A. Jain, A. Narayanan, A. Patterson, J. Taylor, Industry-scale knowledge
     graphs: Lessons and challenges, ACM Queue 17 (2019) 48–75.
 [3] M. Lissandrini, D. Mottin, K. Hose, T. B. Pedersen, Knowledge graph exploration systems:
     are we lost?, in: CIDR, 2022.
 [4] M. Lissandrini, M. Brugnara, Y. Velegrakis, Beyond macrobenchmarks: Microbenchmark-
     based graph database evaluation, Proc. VLDB Endow. 12 (2018) 390–403.
 [5] K. Nitta, I. Savnik, Survey of RDF Storage Managers, in: Proceedings of the International
     Conference on Advances in Databases, Knowledge, and Data Applications, 2014.
 [6] I. Abdelaziz, R. Harbi, Z. Khayyat, P. Kalnis, A survey and experimental comparison of
     distributed sparql engines for very large rdf data, Proc. VLDB Endow. 10 (2017) 2049–2060.
 [7] K. Rabbani, M. Lissandrini, K. Hose, Optimizing SPARQL queries using shape statistics, in:
     EDBT 2021, OpenProceedings.org, 2021, pp. 505–510.
 [8] S. Harris, A. Seaborne, SPARQL 1.1 Query Language. W3C Recommendation, 2012.
 [9] R. Cyganiak, A relational algebra for SPARQL, Technical Report, HP Laboratories Bristol,
     Bristol, UK, 2005.
[10] F. Frasincar, G.-J. Houben, R. Vdovjak, P. Barna, RAL: An Algebra for Querying RDF,
     WWW 7 (2004) 83–109.
[11] S. Idreos, K. Zoumpatianos, B. Hentschel, M. S. Kester, D. Guo, The data calculator: Data
     structure design and cost synthesis from first principles and learned cost models, in:
     SIGMOD, 2018, pp. 535–550.
[12] M. Atre, J. Srinivasan, J. A. Hendler, Bitmat: A main-memory bit matrix of RDF triples for
     conjunctive triple pattern queries, in: ISWC (Posters & Demonstrations), 2008, pp. 1–2.
[13] M. A. Bornea, J. Dolby, A. Kementsietsidis, K. Srinivas, P. Dantressangle, O. Udrea, B. Bhat-
     tacharjee, Building an efficient RDF store over a relational database, in: SIGMOD, 2013.
[14] A. Gubichev, S. J. Bedathur, S. Seufert, Sparqling kleene: fast property paths in RDF-3X,
     in: Workshop on Graph Data Management Experiences and Systems, GRADES, 2013.
[15] H. Zhang, D. G. Andersen, A. Pavlo, M. Kaminsky, L. Ma, R. Shen, Reducing the storage
     overhead of main-memory OLTP databases with hybrid indexes, in: SIGMOD, 2016.
[16] D. Ficara, S. Giordano, G. Procissi, F. Vitucci, Multilayer compressed counting bloom
     filters, in: Proceedings of the 27th Conference on Computer Communications, IEEE, 2008.
[17] H. Zhang, H. Lim, V. Leis, D. G. Andersen, M. Kaminsky, K. Keeton, A. Pavlo, Surf: Practical
     range query filtering with fast succinct tries, in: SIGMOD, 2018, pp. 323–336.
[18] A. Pavlo, G. Angulo, J. Arulraj, H. Lin, J. Lin, L. Ma, P. Menon, T. C. Mowry, M. Perron,
     I. Quah, S. Santurkar, A. Tomasic, S. Toor, D. V. Aken, Z. Wang, Y. Wu, R. Xian, T. Zhang,
     Self-driving database management systems, in: CIDR, 2017.
�

Understanding RDF Data Representations in Triplestores[edit]

The last editor of this page did not have the right to Embed PDFs into pages.

Understanding RDF Data Representations in
Triplestores*
Matteo Lissandrini1 , Tomer Sagi1 , Torben Bach Pedersen1 and Katja Hose1
1
    Aalborg University, Denmark


                                         Abstract
                                         Because of the flexibility and expressiveness of their model, Knowledge Graphs (KGs) have received
                                         increasing interest. These resources are usually represented in RDF and stored in specialized data
                                         management systems called triplestores. Yet, while there exists a multitude of such systems, exploiting
                                         varying data representation and indexing schemes, it is unclear which of the many design choices are
                                         the most effective for a given database and query workload. Thus, first, we introduce a set of 20 access
                                         patterns, which we identify within 6 categories, adopted to analyze the needs of a given query workload.
                                         Then, we identify a novel three-dimensional design space for RDF data representations built on the
                                         dimensions of subdivision, redundancy, and compression of data. This design space maps the trade-offs
                                         between different RDF data representations employed to store RDF data within a triplestore. Thus, each
                                         of the required access patterns is compared against its compatibility with a given data representation. As
                                         we show, this approach allows identifying both the most effective RDF data representation for a given
                                         query workload as well as unexplored design solutions.

                                         Keywords
                                         RDF Data Management, Knowledge Graphs, Data Management Systems Design




1. Introduction
Knowledge Graphs (KGs) are nowadays already in widespread use because of their ability to
represent entities and their relationships in many domains [2, 3]. KGs are usually represented as
RDF data1 . RDF models a graph as a set of subject-predicate-object triples, such that nodes serve
as subjects and objects and edge labels as predicates. RDF data is usually stored in specialized
data management systems called triplestores supporting declarative queries written in the
SPARQL language. In parallel to the growing interest in KGs, we have seen the emergence of
many different triplestore implementations, either from academic prototypes (e.g., RDF-3X),
community projects (e.g., JENA TDB), or commercial products (e.g., Virtuoso, and GraphDB).
Nonetheless, each system employs its own peculiar set of design choices, each appearing equally
compelling at a first glance. Here, we provide a better understanding of the pros and cons of the

SEBD 2022: The 30th Italian Symposium on Advanced Database Systems, June 19-22, 2022, Tirrenia (PI), Italy
*The full paper was originally published in the VLDB journal [1].
" matteo@cs.aau.dk (M. Lissandrini); tsagi@cs.aau.dk (T. Sagi); tbp@cs.aau.dk (T. B. Pedersen); khose@cs.aau.dk
(K. Hose)
� 0000-0001-7922-5998 (M. Lissandrini); 0000-0002-8916-0128 (T. Sagi); 0000-0002-1615-777X (T. B. Pedersen);
0000-0001-7025-8099 (K. Hose)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR

         CEUR Workshop Proceedings (CEUR-WS.org)
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073




                  1
     The property graph [4] model has also been proposed, but this model has still not converged in a common
standard and is not usually adopted to store KGs.
�various design choices w.r.t. the needs of a given use case and query workload. We specifically
focus on the existing solutions for storing, indexing, and querying RDF data. Our analysis
complements existing surveys [5, 6], which merely provide a taxonomy of the features offered by
the systems (e.g., API and data loading facilities) or classify them according to their underlying
technology (e.g., relational versus native graph), but fail short in providing a detailed analysis of
the design space in which their internal architectures reside. At the same time, this study, also
complements existing efforts in improving the performances of these system by optimizing their
query processing techniques [7]. Hence, we first characterize the feature space for query access
patterns, i.e., the logical operations in a query that, given some input values, establish what kind
of output values must be returned. Then, we uniquely identify the defining characteristics of
existing RDF data representations within a unifying three-dimensional design space across three
dimensions: Subdivision, Compression, and Redundancy (SCR). The design space defines the
dimensions along which any storage system for RDF data must be designed. Thus, our approach
identifies the most suitable data representation for a given access pattern, as well as access
patterns that are currently not optimally supported by the existing representations. We thereby
make the following contributions: (i) a feature space of 20 access patterns within 6 categories
for SPARQL queries (Table 1); (ii) a design space for RDF data representations based on data
subdivsion, redundancy, and compression (Table 2), which allows analyzing the trade-offs of
each design choice; and finally (iii) a thorough methodology to analyze how different design
choices impact system performance when answering access patterns with specific features. In
the following, we first introduce the core concepts regarding the RDF data model (Section 2) and
the features of the SPARQL query language (Section 3). Then, we present our SCR design space
(Section 4) and conclude explaining how it allows deriving a number of important findings
about the data representations adopted by existing systems (Section 5).


2. Preliminaries: RDF & SPARQL
The RDF data model revolves around the concepts of RDF triples and RDF graph patterns. An
RDF triple, which corresponds to an edge in the graph, is a factual statement consisting of a
subject (𝑠), a predicate (𝑝), and an object (𝑜). Subjects and objects, which represent nodes, can
be resources identified by International Resource Identifiers (IRI), anonymous nodes identified
by internal IDs (called blank nodes), or literals (which can appear only as objects). Thus, an
RDF graph 𝒢 is a set of RDF triples. An RDF graph is queried by issuing a SPARQL query whose
answers are computed based on the subgraphs matching the structural and logical constraints
it prescribes. SPARQL queries contain one or more basic graph patterns (BGPs), which are sets
of triples with zero or more of their components replaced with variables from an infinite set
of variables 𝒳 . Thus, triple patterns are (𝑠, 𝑝, 𝑜) triples, where any position may be replaced
by a variable 𝑥∈𝒳 . We refer to each value in each position as an atom. Moreover, there are
special types of patterns that match pairs of atoms connected through sequences of edges
satisfying a specific set of predicates, e.g., all nodes connected by an arbitrary sequence of edges
with ex:childOf predicates. These patterns are called property paths. They are defined via a
specialized form of expressions called path expressions (similar to regular expressions) and offer
a succinct way to extend matching of triple patterns to paths of arbitrary length [8]. Finally,
�                                                                           Return Values
          SELECT ?event ?eventLabel
                                                                           Constants PO
          WHERE {
                                                                            Constants P
               ?event wdt:P31/wdt:P279* wd:Q1190554. ★ ✪
                                                                           Range Closed
               ?event wdt:P585|wdt:P580 ?date. ★
                                                                           Range Special
               ?event rdfs:label ?eventLabel.
                                                                          Pivot N-way S≡S
               FILTER(DATATYPE(?date) = xsd:dateTime).
                                                                           ★ 1-hop SàO
               FILTER(?date <= 2019-12-31 && ?date > 2019-12-01).
                                                                           ★ 1-hop OàS
               FILTER(LANG(?eventLabel) = "en").                           ✪ SàP*àO
          }
Figure 1: SPARQL query with annotated access patterns.


solutions of a query are the variables bindings, i.e., the mappings from each variable to a given
node or edge in the graph that satisfies the conditions in the BGP.
   In its simplest form, a SPARQL query has the form “SELECT 𝑉          ⃗ WHERE 𝑃 ”, with 𝑃 =
{𝑡1 , ... , 𝑡𝑛 } being a set of triple patterns (as in Figure 1, the annotations are explained in
the next section). Optionally, one or more FILTER clauses further constrain the variables in
𝑃 . Let 𝒳𝑃 denote the finite set of variables occurring in 𝑃 , i.e., 𝒳𝑃 ⊂𝒳 , then 𝑉⃗ is the vector
of variables returned by the query such that 𝑉     ⃗ ⊆ 𝒳𝑃 . Additional operators such as UNION
or OPTIONAL allow more than one BGP in a query by defining the non-conjunctive semantics
of their combination. Finally, SPARQL queries can also make use of GROUP BY and aggregate
operators. SPARQL queries are declarative and are therefore designed to be decoupled from the
physical data access methods, i.e., the low-level algorithms and data structures for organizing
and accessing data. This decoupling allows specific triplestore implementations to use different
data representations and query processing techniques to execute a given query.


3. SPARQL Access Patterns
To identify the best internal representation for an RDF dataset and a given SPARQL query, we
decompose the query’s triple patterns into atomic operators and identify how they access a
given data representation. In particular, we identify a set of access patterns that enable the
required translation from the declarative query language to its logical query plan. Thus, an
access pattern is the set of logical operations that, given a set of variable bindings over a graph,
determine what data to access (and optionally to change) and what output to produce. The
feature space of access patterns (summarized in Table 1) is comprised of 6 dimensions, each
containing a set of alternative features, namely: Constants, the presence of constant values as
atoms (instead of variables); Filter, the presence (or absence) of a filter on a range of values;
Traversal, the connection of two nodes by means of one or more subsequent triple patterns;
Pivot, the atom connecting multiple triple patterns in a BGP; Return, the information expected
to be returned; Write, whether and how a change in the content of the graph is needed.
Consider the example in Figure 1, representing a query to retrieve location coordinates of
archaeological sites, and compare it to the access patterns in Table 1. We can recognize, for
instance, the property path wdt:P31/wdt:P279*, of the form p1/p2* o, meaning that it
�Table 1
Feature space of access patterns for a SPARQL query.

 Dimension   Features

 Constants   𝑆, 𝑃, 𝑂                            𝑆𝑃, 𝑆𝑂, 𝑃 𝑂                          𝑆𝑃 𝑂
             1 Constant                         2 Constants                          3 Constants
 Filter      <&>                                <|>                                  T
             Closed range                       Open range                           Special type range
                                                   k
 Traversal   𝑠→𝑜                                𝑠→− 𝑜                                𝑝*/+
             1-hop over 𝑝                       k-hop over 𝑝1 , . . . , 𝑝𝑘           ?-hop over 𝑝
 Pivot       𝑠≡𝑠 / 𝑜≡𝑜 / 𝑝≡𝑝                    𝑜≡𝑠 / 𝑜≡𝑝 / 𝑠≡𝑝                      𝑣0 ≡𝑣1 ∧ 𝑣0 ≡𝑣2 ∧ ...𝑣0 ≡𝑣𝑁
             binary; same S/P/O position        binary; different S/P/O positions    N-way; any S/P/O positions
 Return      Values (all)   Values (distinct)   Values (sorted)          Existence   Aggregate

 Write       Update                             Insert                               Delete


would match two alternative access patterns: (1) 1-hop traversals over p1 = wdt:P31 reaching
the target node o=wd:Q1190554 directly, and (2) *-hop traversals starting with one edge for
wdt:P31 and reaching the object through a sequence of arbitrarily long paths matching the p2
= wdt:P279 triple pattern. On the other hand, the ?event variable is a 3-way pivot, which
can also be executed as a set of binary 𝑠≡𝑠 pivot access patterns. Similarly, we can also identify
range filters and the presence of constant values. Each of these access patterns can be effectively
implemented and answered by specific RDF data representations and indexes.


4. RDF Data Representations in the SCR Space
To support the data access patterns, triplestores implement a set of different data representations
storing all or a subset of the graph 𝒢. A data representation 𝒟 to answer the given access
pattern 𝒜 needs to provide a way to retrieve the values for the bindings of the variables 𝑉 ⃗ from
the information stored in 𝒢 (or to modify 𝒢 for write operations). Therefore, given an access
pattern 𝒜 and a data representation 𝒟 of 𝒢, we evaluate the performance of 𝒟 in providing the
correct instantiations of 𝑉
                          ⃗ for 𝒜. Moreover, we also evaluate how effective, in terms of time, it
is to execute 𝒜 over 𝒟, i.e., its cost. Answering such a question allows, given two distinct data
representations, to select the most appropriate one, i.e., the one with the lowest cost.
   Cost model. Answering a query using a particular data representation incurs a cost expressed
in terms of the time it takes to retrieve all such answers. Several such cost models have
been proposed for RDF [9, 10], but were tied to a simple two-tiered memory hierarchy and
assumed a relational data representation. While such assumptions are no longer applicable,
it still holds across novel (existing or future) machine architectures and memory types that
random seek operations incur a different cost than sequential read operations. Additionally, we
identify another set of costs derived from the difference between accessing compressed and
uncompressed data. That is, while the use of compression is employed to minimize memory
requirements and speed up the movement of large result sets, it incurs additional costs in
compression and decompression. Therefore, we follow the recent convention of employing a
�            ?s    ?p        ?o      b)     ?s                                    c)     ?s     <p1> <p2> <p3> . . .
     a)                                             <p1>,<o1>    <p2>,<o2>            <uri1> <o1>      <o2>   -   ...
          <uri1> <p1> <o1>               <uri1>
          <uri1> <p2> <o2>                                                            <uri2> <o3>       -   <o4> ...
                                         <uri2>     <p1>,<o3>    <p3>,<o4>            <uri3> <o1>       -     -   ...
          <uri2> <p1> <o3>               <uri3>
          <uri2> <p3> <o4>                                                            <uri4>    -      <o5>   -   ...
                                         <uri4>     <p1>,<o1>
          <uri3> <p1> <o1>                                                              ...    ...      ...  ... ...
                                          . . .
          <uri4> <p2> <o5>                                                            <uri22> <o14>     -   <o15> ...
                                         <uri22>    <p2>,<o5>
           . . . . . . . . .
          <uri22> <p1> <o14>                                     <p3>,<o15>
          <uri22> <p3> <o15>                        <p1>,<o14>


                       d)        <p1><uri22><o14>    <p3><uri1><o4>



                            <p1><uri3><o1>               <p2><uri1><o2>       <p2><uri4><o5>



                       <p1><uri1><o1>      <p1><uri2><o3>        <p1><uri5><o6>       <p1><uri6><o7>

Figure 2: Common representations: a) sorted file, b) hash map, c) table, and d) B+ tree

set of cost constants [11] differentiating between the cost of random and sequential access and
between the cost of compressed and uncompressed access.
   Data Representation Compatibility. To assess if a data representation can efficiently
support a specific access pattern, we define the notion of compatibility. We define this notion
specifically for read operations, under the assumption that a read operation is required before
deletion (to locate the data to be deleted) and before insertion (to locate where to insert the data
and to avoid duplicates) and is therefore required for every type of access pattern. Therefore,
given an access pattern 𝒜 and a data representation 𝒟 able to answer 𝒜, we assume that there
is a sequence of operations specified by an algorithm Γ defined over 𝒟 that can compute an
answer 𝒮, such that 𝒮 = Γ(𝒜, 𝒟). The number and cost of the operations specified by Γ directly
determine whether 𝒟 is a representation suitable for efficiently computing the answers to 𝒜.
When measuring the number of operations required to be executed over 𝒟, we distinguish
between random seek and sequential operations. In particular, an RDF data representation can
be seek compatible, sequence compatible, or selection compatible.
   For instance, one way to store 𝒢 is as a list of 3-tuples (one per edge) sorted by subject and
then by predicate and object (Figure 2.a and 2.d) with a clustered B+ tree index over them. In this
representation, the query processing cost would be analogous to that of a relational table with
three attributes (𝑠, 𝑝, 𝑜), all part of a primary index. This representation is sequence compatible
with any 1-hop access pattern that binds 𝑠, both 𝑠 and 𝑝, or all three positions. That is, the
algorithm Γ(⟨s1,p1,?o⟩, {B+tree,sorted(𝒢)}), would first find the first tuple performing 𝑙𝑜𝑔(𝒢)
steps traversing the B+ tree looking for s1, p1, and then perform a linear scan over the file to
retrieve the remaining tuples. However, the B+ tree is not seek compatible because the time to
find the first result depends on the size of the graph, i.e., it involves 𝑙𝑜𝑔(|𝒢|) seek operations to
reach the first result. A different data representation is to employ a key-value data structure
(similar to Figure 2b) and use the pair subject-predicate as the key, and the object as the value.
In this data structure, triples sharing the same 𝑠 and 𝑝 will store the list of objects contiguously.
This data structure is both seek and sequence compatible for a traversal that, given 𝑠 and 𝑝,
retrieves all corresponding objects. Nevertheless, this representation is neither seek compatible
�Table 2
Summary of data representation design space axes.
 Axis          Minimal         Ex-   Maximal Extreme            Positive Effect     Negative Effect
               treme
 Subdivision   Unsorted file         Pointers between every     ↓ # unneeded data   ↑ # random seeks
                                     related data item          items
 Compression   No compression        Compress all subdivi-      ↓ # read bytes      ↑ Decompression cost
                                     sions in all representa-
                                     tions
 Redundancy    Single represen-      One representation for     ↓ # random seeks    ↑ maintenance cost, storage
               tation                each possible BGP                              cost, query optimization time


nor sequence compatible if the query requires all edges for predicate 𝑝 regardless of 𝑠. Thus,
the definitions of cost and compatibility presented above allow analyzing the advantages and
drawbacks of the different choices in each dimension of the design space.
   The Design Space Dimensions. When designing data representations for an RDF store, one
can model the design decisions over three axes: Subdivision, Compression, and Redundancy. Each
of these orthogonal axes, whose properties are summarized in Table 2, represents a continuum
along which the data representation adopted by a system can be positioned.
   The Subdivision axis determines how fragmented the data is. At one extreme, all data is
stored contiguously in a single structure; at the other extreme, each edge and node is stored as
a separate object with pointers to its neighbors. In between, there are various data structures
such as B+ trees or hash maps. This axis contains design decisions such as sorting, grouping,
and hashing. Each of these decisions creates an additional subdivision in the data. Increasing
the extent of subdivision, allows us to minimize the number of unneeded data items accessed
to answer an access pattern (resulting in fewer operations that return items not in the result
set). For instance, in a sorted file (Figure 2a), the sorting keys act as the simplest subdivision
by collecting all triples with the same value together. On the other hand, with a hash table
(Figure 2b), given the target IRI, the hash function separates all relevant triples sharing the same
key. Therefore, decisions across the subdivision axis easily determine whether a data structure
is selection compatible, i.e., if it binds all and only the answers within a specific subdivision, but
it also impacts random and sequence compatibility.
   The goal of Compression is to minimize the number of bits read to reach the first tuple in the
result set and the number of bits required to read and potentially store the result set for further
processing. The potentially negative impact of compression is, of course, the decompression
required to evaluate a predicate in the access pattern if the access pattern is incompatible or
partially incompatible. For example, consider a compressed data representation tuned for queries
of the form (𝑠, ?𝑝, ?𝑜) and 𝑠≡𝑠 access patterns (e.g., BitMat [12]). Using this representation
to answer an 𝑠 𝑝*/+ 𝑜 pattern (i.e., is 𝑜 reachable from 𝑠 through edges labeled with 𝑝) would
require a potentially large number of row decompression operations to identify the edges that
satisfy the traversal condition. Moreover, decisions across the compression axis heavily impact
the selection compatibility when data that does not contribute to the answer are compressed
together with data relevant for an access pattern.
   The third axis is Redundancy which causes (redundant) copies of the data to be stored in the
system. By adding redundant data representations and indexes, it is possible to define ideal (seek,
�sequence, and selection compatible) data representations for each access pattern. However,
this comes at the cost of having to store the same information multiple times. For example, by
holding both an SPO clustered index and a PSO clustered index, each triple is stored twice. Thus,
this hinders the compatibility with write access patterns. Moreover, design decisions including
full/partial replication need to find a trade-off between storage space and efficient support of
query access patterns. Hence, the cost of maintaining multiple representations is three-fold: (i)
Increased latency for delete and insert operations with possibly reduced performance of read
operations as well. (ii) An increase in space requirements, subsequently straining the limited
space in main memory and causing an increase in all read costs. (iii) An increase in query
optimization time because alternative structures to access the data result in a higher number of
query execution plans that have to be considered.


5. The Capabilities of the SCR Space Analysis
Thanks to the feature space for analyzing query workload access patterns, and the Subdivision-
Compression-Redundancy (SCR) design space of data representations for RDF databases, we
enable the analysis of RDF store design decisions, specifically, which data representations can
effectively support a given workload. Previous workload analyses are centered around those
characteristics that would impact mainly the query optimizer. Instead, we present an analysis
of the access patterns specific to the storage layer. The effect of various choices along the three
axes is summarized in Table 2. We performed an analysis of popular RDF store benchmarks
under these assumptions and identified a number of workload requirements for which systems
rarely provide effective optimizations. Considering the Subdivision dimension, many systems
implement dedicated tables for triples sharing a specific predicate. Alternatively, queries that
match triples with the same subject can usually exploit the clustered representation where
several properties for the same subject (or object) are stored together in a single row [13].
Furthermore, a constant in 𝑠 or 𝑜 can be exploited by a hash index of the form 𝑠 ↦→𝑝𝑜. Yet, no
representation exploits subdivision for a pair of constants, e.g., 𝑠𝑝 ↦→𝑜. Overall, we see that
there is a strong incompatibility between filters on literals and the fact that all representations
store literals and special types contiguously. We also perform a similar analysis also across
the Redundancy dimension. For instance, specialized indexes have been proposed to speed
up reachability and multi-hop path expressions (e.g., Sparqling kleene [14]). Yet, in existing
systems, data representations supporting the traversal operations required by k-hop and *-hop
access patterns are rarely employed. Our survey also reveals the absence of index Compression
solutions (e.g., [15]). For instance, we do not find any application of effective compression
solutions for secondary representations, such as counting indexes (e.g., sp→#) and structures
like bloom-filters [16] or compact LSM-tries like SuRF [17].
   Thus, researchers can use the SCR design space to design novel solutions in an informed
manner. Such designs can be achieved either manually or even semi-automatically as proposed
by the recent effort in self-designing data structures [11] and self-organizing database systems
[18]. By cross-referencing the access patterns in a query workload with SCR design options,
future RDF stores will be able to add components of the system to match the expected workload.
�Acknowledgments
This research is supported by the European Union H2020 research and innovation program under
the Marie Skłodowska-Curie grant agreement No 838216 and the Poul Due Jensen Foundation.


References
 [1] T. Sagi, M. Lissandrini, T. B. Pedersen, K. Hose, A design space for rdf data representations,
     The VLDB Journal (2022) 1–27.
 [2] N. Noy, Y. Gao, A. Jain, A. Narayanan, A. Patterson, J. Taylor, Industry-scale knowledge
     graphs: Lessons and challenges, ACM Queue 17 (2019) 48–75.
 [3] M. Lissandrini, D. Mottin, K. Hose, T. B. Pedersen, Knowledge graph exploration systems:
     are we lost?, in: CIDR, 2022.
 [4] M. Lissandrini, M. Brugnara, Y. Velegrakis, Beyond macrobenchmarks: Microbenchmark-
     based graph database evaluation, Proc. VLDB Endow. 12 (2018) 390–403.
 [5] K. Nitta, I. Savnik, Survey of RDF Storage Managers, in: Proceedings of the International
     Conference on Advances in Databases, Knowledge, and Data Applications, 2014.
 [6] I. Abdelaziz, R. Harbi, Z. Khayyat, P. Kalnis, A survey and experimental comparison of
     distributed sparql engines for very large rdf data, Proc. VLDB Endow. 10 (2017) 2049–2060.
 [7] K. Rabbani, M. Lissandrini, K. Hose, Optimizing SPARQL queries using shape statistics, in:
     EDBT 2021, OpenProceedings.org, 2021, pp. 505–510.
 [8] S. Harris, A. Seaborne, SPARQL 1.1 Query Language. W3C Recommendation, 2012.
 [9] R. Cyganiak, A relational algebra for SPARQL, Technical Report, HP Laboratories Bristol,
     Bristol, UK, 2005.
[10] F. Frasincar, G.-J. Houben, R. Vdovjak, P. Barna, RAL: An Algebra for Querying RDF,
     WWW 7 (2004) 83–109.
[11] S. Idreos, K. Zoumpatianos, B. Hentschel, M. S. Kester, D. Guo, The data calculator: Data
     structure design and cost synthesis from first principles and learned cost models, in:
     SIGMOD, 2018, pp. 535–550.
[12] M. Atre, J. Srinivasan, J. A. Hendler, Bitmat: A main-memory bit matrix of RDF triples for
     conjunctive triple pattern queries, in: ISWC (Posters & Demonstrations), 2008, pp. 1–2.
[13] M. A. Bornea, J. Dolby, A. Kementsietsidis, K. Srinivas, P. Dantressangle, O. Udrea, B. Bhat-
     tacharjee, Building an efficient RDF store over a relational database, in: SIGMOD, 2013.
[14] A. Gubichev, S. J. Bedathur, S. Seufert, Sparqling kleene: fast property paths in RDF-3X,
     in: Workshop on Graph Data Management Experiences and Systems, GRADES, 2013.
[15] H. Zhang, D. G. Andersen, A. Pavlo, M. Kaminsky, L. Ma, R. Shen, Reducing the storage
     overhead of main-memory OLTP databases with hybrid indexes, in: SIGMOD, 2016.
[16] D. Ficara, S. Giordano, G. Procissi, F. Vitucci, Multilayer compressed counting bloom
     filters, in: Proceedings of the 27th Conference on Computer Communications, IEEE, 2008.
[17] H. Zhang, H. Lim, V. Leis, D. G. Andersen, M. Kaminsky, K. Keeton, A. Pavlo, Surf: Practical
     range query filtering with fast succinct tries, in: SIGMOD, 2018, pp. 323–336.
[18] A. Pavlo, G. Angulo, J. Arulraj, H. Lin, J. Lin, L. Ma, P. Menon, T. C. Mowry, M. Perron,
     I. Quah, S. Santurkar, A. Tomasic, S. Toor, D. V. Aken, Z. Wang, Y. Wu, R. Xian, T. Zhang,
     Self-driving database management systems, in: CIDR, 2017.
�