Vol-3194/award2

Wolfgang Fahl

Paper

Paper
edit
description
id	Vol-3194/award2
wikidataid	Q117344971→Q117344971
title	Querying Genome Databases
pdfUrl	https://ceur-ws.org/Vol-3194/award2.pdf
dblpUrl	https://dblp.org/rec/conf/sebd/Mecca22
volume	Vol-3194→Vol-3194
session	→

Paper[edit]

Paper
edit
description
id	Vol-3194/award2
wikidataid	Q117344971→Q117344971
title	Querying Genome Databases
pdfUrl	https://ceur-ws.org/Vol-3194/award2.pdf
dblpUrl	https://dblp.org/rec/conf/sebd/Mecca22
volume	Vol-3194→Vol-3194
session	→

Querying Genome Databases[edit]

Querying Genome Databases
Extended Abstract

Giansalvatore Mecca1
1
Dipartimento di Matematica, Informatica ed Economia, Universitá della Basilicata

Abstract
Genome Databases contain genetic information about human beings and other species of living organisms.
Since 1986, year in which the Human Genome Project started, they have grown exponentially in size every
year. Due to the unusual structure of the stored information, traditional DBMS and query languages
have provided only little support so far. In this paper we discuss some of the issues related to genomic
information management.

1. Introduction
The Human Genome Project is a research effort started back in 1986 as an initiative of two U.S.
institutions, the Department of Energy (DOE) and the National Institute of Health (NIH). The
focus of the project is the study of genetic information, which represents the base for the life of
every living organism. The purpose is to gain a better understanding of the nature of genetic
and cellular mechanisms, that will provide an unvaluable source of information for medicine
and health care in the next years.
Computer science and especially database technology has been by the side of genetics and
molecular biology from the start. In fact, the huge amount of experimental data produced and
the great complexity of experimental protocols require sophisticated computing applications
to manage and interpret data. The main feature of these applications is represented on one
side by the sequential structure of data to be stored, and on the other side by the complex
nature of the transformations required to study these data. It is rather apparent the fact that
commercial DBMS based on relational technology are not completely suited to this purpose. In
fact, commercial products have been rather unsuccessful in this field so far.
This paper tries to sketch some of the issues related to sequential data management in the
context of Genome databases, pointing out where the main weaknesses of current technology
are. The paper mainly focuses on aspects related to query languages.
Due to the interdisciplinary nature of the matter, the paper is organized as follows. In
Section 2 we introduce a few basic notions of molecular biology, necessary to gain a better
understanding of the domain of interest and of examples throughout the paper. For a detailed
introduction to the subject, we refer the reader to [14, 7]. Then, the Human Genome Project is
briefly presented in Section 3. Technological issues related to Genome databases are discussed
in Section 4, whereas Section 5 specifically focuses on aspects related to querying genomic data.
SEBD 2022: The 30th Italian Symposium on Advanced Database Systems, June 19-22, 2022, Tirrenia (PI), Italy
" giansalvatore.mecca@unibas.it (G. Mecca)
© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org)
�2. Notions of Molecular Biology
The term genome is used to refer to the whole genetic information of an organism. This
information, stored in nucleic acids, the deoxiribonucleic acid (DNA), and the ribonucleic acid
(RNA), contains the master blueprint for all cellular structures and activities.
Nucleic acids are complex molecules. Their constituents, called nucleotides or bases, are
simpler molecules that we can divide in two groups: (i) purines, that is, adenine, usually denoted
a, and citosine (c); (ii) pyrimidines, that is, guanine (g) and timine (t). Nucleotides bind to form
long strands called polynucleotides, which are essentially polymers 1 in which nucleotides are
linked to each other. A DNA molecule is made of two strands wrapped around each other to
form a double helix. The double helix spatial configuration is due to weak hydrogenous bonds
that are established between nucleotides on the two strands. This bonds, often referred to as
base pairings, respond to a precise complementarity relationship according to which a only pairs
with t and c only pairs with g. The bond is directional, that is, the DNA molecule has a head
(usually called the 5′ end) and a tail (called the 3′ end), and one of the strands is called the direct
strand and the other the complementary strand.

Example 1. The term DNA sequence denotes the arrangement of nucleotides in each of the
two strands of a DNA molecule. Such sequences are naturally represented as strings over the
alphabet {a, c, g, t}.
For example, the following is a representation of a DNA fragment.
... a c t t g c c a a ...
5’ | | | | | | | | | 3’
... t g a a c g g t t ...

The high grade of redundancy, which makes possible to produce high-fidelity copies of DNA
molecules, is the very base of the replication process occurring in living cells.
In fact, the genetic information of an organism is stored in one or more long DNA molecules,
called chromosomes. For example, in human beings, each cell contains in its nucleus 23 pairs
of chromosomes. As long as the organism grows, that is, its cells duplicates, this information
is passed on to new cells. In particular, whenever a cell duplicates, each of the daughter cells
receives a copy of the chromosome set of the original cell. In order to do this, the cell first
replicates its whole genome. During such replication, hydrogen bonds in DNA molecules break
and strands separate; then, free nucleotides in the nucleus are used to build a new complementary
strand for each of the original strands, thus originating two identical copies of the cell genome.
In this way, each cell in the living organism has in its nucleus the whole genomic information,
and can use it to perform its functions.
The means by which chromosomes affect the structure and the functions of living cells are
proteins, that can be considered a primary component of living things. From the structural point
of view, proteins are polymers themselves, made of basic constituents called amino acids. There
are 20 different amino acids in nature. Amino acids form peptide bonds to create long chains
called polypeptides; the sequence of amino acids that form a protein is called its primary structure,
usually expressed as a string over a conventional 20-symbol alphabet {𝐴, 𝐶, 𝐾, 𝑀, 𝑁, . . .}.
1
A polymer is a long sequence of similar bonded elements.
� Indeed, the primary role of nucleic acids is to carry the encoding of the structure of specific
proteins. We call a gene each portion of chromosome containing information about the primary
structure of a protein. It is worth noting that only about 10% of the whole human genome is
known to contain protein-coding regions. The remaining part consists of noncoding regions –
such as control regions – and other completely meaningless regions.
Roughly speaking, we can say that there is one gene for each specific protein in a living
organism. There are, in fact, at least 100.000 different genes in humans. A simple and powerful
code is used for storing such information. In particular, to each non-overlapping triplet of
nucleotides in a gene, called a codon, corresponds a particular amino acid. Four distinguished
codons are used as stop codons, that is, to signal the end of the coding sequence. It is easy to
note that also in this case the code is redundant, since there are 43 different codons and only 20
amino acids (plus the stop signal).
When a protein is syntesyzed in a living cell, the DNA fragment representing the correspond-
ing gene undergoes several transformations.
• the DNA sequence is first transcribed into an RNA sequence; RNA is a nucleic acid whose
structure is very similar to the one of DNA; the only difference is that a new nucleotide,
called uracil (u), takes the place of timine (t); RNA molecules are single-stranded, so that
each RNA fragment can be easily represented as a string over the alphabet {a, c, g, u};
when the transcription of a DNA fragment occurs, the two strands separates and one
of them is used as a template to build a RNA strand, according to a slightly different
complementarity relationship (the only difference is that each a is transcribed to a u);
• then, the RNA molecule travels out of the nucleus, and undergoes a splicing process;
in fact, genes are not necessarily contiguous inside chromosomes; if we consider the
chromosome region containing a gene, there will be some gene regions that actually code
for proteins, called exons, and some interleaving regions, called introns, which do not
contain useful information; when DNA is expressed, introns need to be spliced out, so
that exons can be connected together and the whole coding region can be processed; such
splicing happens in specialized apparates called spliceosomes, which take a transcripted
RNA sequence and splice out regions corresponding to introns, connecting exons together;
the resulting molecule is called messenger RNA,
• the resulting RNA sequence is ready to be expressed; to do this, other apparates, called
transfer RNAs, translate the sequence by simply reading the sequence codon by codon
– that is, three symbols at a time – and, for each codon, appending the corresponding
amino acid to the protein. For example, whenever the codon aug is encountered, the
corresponding amino acid, methionine, is appended to the amino acid sequence. The four
stop codons, namely gct, gcc, gca and gcg, signal that this translation process has to
stop.

Example 2. As an example, consider the following DNA fragment representing a gene.
1 2 3 4 5 6 7 8 9 0 1 2 3 4
t a c g c c a a c t t a c t
5’ | | | | | | | | | | | | | | 3’
a t g c g g t t g a a t g a
�Suppose also that we know that the exons are from base 01 to base 06 and from base 09 to base
14. When the direct strand undergoes transcription, the following messenger RNA is produced.
a u g c g g u u g a a u g a

After splicing, the two exons are spliced together, so that the sequence to translate and the
resulting protein are the following.
a u g c g g g a a u g a

Here, met, arg and val stand for the methionine (M), arginine (A) and glutamic acid (G) aminoacids,
respectively.

3. The Human Genome Project
The final goal of the Human Genome project is to determine the exact sequence of the whole
human genome, i.e. of every gene present in the human race, and the corresponding functions.
To give a measure of the effort, it is worth noting that human genomes are about 3 billion base
long. If compiled in books, the data would fill an estimated 200 volumes, each the size of a 1000
pages telephone book, and reading it would require about 30 years working around the clock.
The achievement of such an ambitious goal is unfortunately very far away for now, as
the current experimental technology does not allow to read sequences longer than a few
hundred bases. In fact, time and cost considerations make large-scale sequencing projects
totally impractical. Consider that the smallest human chromosome (chromosome Y) is 50
million bases. The best available equipment can sequence only 50.000 to 100.000 bases per
year at an approximate cost of $1 to $2 per base. At this rate, an incceptable amount of 30.000
work-years and at least $3 billion would be required for sequencing alone. It is apparent how
the study of new sequencing technologies is a major concern in genome research these days.
Such a strong limitation imposes complex experimental protocols to reduce long chromosomes
into manageable fragments. On the other side, the study of chromosomes is usually focused on
different kinds of information rather than on the actual sequence. The objective is to build maps
of chromosomes, that is, to find the approximate position of relevant sites on the chromosomes,
corresponding to functionally important regions.
Due to the enormous size of human genomes and to the difficulty to perform experiments on
human beings, much of the research work is performed on other species, as mice, insects, yeasts
and bacteria. The following table gives an idea of the size of different genomes studied so far.

Comparative Sequence Sizes
Yeast Chromosome 3 350.000 bases
Escherichia Coli (bacterium) genome 4.6 million bases
Entire yeast genome 15 million bases
Smallest human chromosome (Y) 50 million bases
Largest human chromosome (1) 250 million bases
Entire human genome 3 billion bases
�4. Genome Databases
The huge amount of data generated by genome research will be used as a primary information
source for human biology and medicine into the future. A main issue is therefore the manage-
ment and distribution of genome data. Experimental information, sequences, maps need to be
collected and stored in order to be fully accessible to the research community.
Clearly, a strong database technology is necessary to provide full support for these activities.
Unfortunately, traditional DBMS have shown serious shortcomings in terms of genome data
management [9].
The first category of limitations is concerned with the data model. In fact, DNA data requires
a flexible sequence data type to be properly represented. Relational database systems do not
provide such a type, so that the only way to implement a DNA sequence type is to store
sequences as text strings. This technique, anyway, do not seem to be completely satisfactory,
since, on one side a 8-bit per base representation is a clear waste of disk space - DNA sequences
can be easily represented in a compress way by using only 2 bits per base; on the other side,
manipulation of text fields, when present in commercial systems, is usually associated with
serious limitations in terms of access. For example, indexing is usually not provided, and
operators such as substring and concatenation need to be implemented by the user.
Indeed, these limitations have made the role of relational DBMSs a very poor one in genome
research. As an example, consider the world most famous sequence database, GenBank [3].
GenBank is the NIH genetic sequence database, a collection of all known DNA sequences. There
are approximately 230 million bases and 238.000 sequences as of December 1994. It represents a
fundamental reference for the research community, since often sequenced DNA regions are
required to be submitted to GenBank for electronic publication even before they are published in
printed form. Information in GenBank is organized in records called entries; each entry contain
a sequence along with some annotations, which provide details about the source organism, gene
or DNA region, and the internal structure of the reported sequence, i.e. the number and position
of exons, introns and relevant control regions.
Although GenBank uses a relational DBMS for internal data manipulation, external access to
the database is not possible through a relational interface. In fact, the most common way to
access the database is to download a flat file, text-based dump of the current version, in which
all entries are listed sequentially.
Other very popular databases have also adopted alternative models of representation. For
example, the ACeDB [19], the integrated nematode worm genome database, use a specialized
hierarchical file system to represent information, and the user can navigate the database by
following links from one record to the other. This model has proven very successful in the
genome community, and other researchers are implementing databases for other species based
on a similar architecture.
The lack of a unifying model for representing information is probably the first obstacle to be
solved on the way of a real integrated access to genome data. In fact, the different databases
spread in the world, containing informations about proteins, maps, metabolic pathways, speak
different languages in terms of data models and data access facilities, so that today is not possible
to ask for biologically relevant queries that involve data contained in different databases.
�5. Query languages for genomic data
The other important cathegory of limitations of current database systems is related to query
languages. The very rich structure of genomic sequences makes querying a critical activity. We
can say that there are two different aspects of queries over sequences that have to be taken into
account.
On one side, an expressive query language should provide powerful pattern matching capa-
bilities. In fact, testing similarities of newly sequenced sequences against existing databases is a
fundamental activity. Right now, a researcher willing to test the degree of omology of a new
sequence against a database needs to recur to dedicated packages. In particular, there are some
well known query servers, for example BLAST [1] or FASTA [16] which can be invoked through
the Internet to test local alignment of short sequences against GenBank. Unfortunately, such
specialized resources do not have a flexible query interface, so that, for example, one cannot
specify queries such as “test the similarities of sequence X against any sequence in the database in
which the organism field equals human”, in which additional conditions are specified to narrow
the range of interesting sequences and speed up the search.
To solve these problems, in the literature, several pattern languages have been proposed
(e.g. [12, 18]). These languages usually provide a flexible syntax for defining patterns, that is,
expressions that correspond to a set of strings over a given alphabet. In principle, one can think
to enrich a relational query language, say SQL, by means of an extended selection operator, that,
given a pattern, allows for extracting the set of strings in the database that match the pattern.
Although the resulting language might represent a good solution to many sequence analysis
problems, it would still be too weak in terms of data restructurings.
In fact, restructurings also play a central role. The raw sequence data usually provides little
insight about the structure and functions of the corresponding DNA. For example, given a
DNA sequence, one would like to be capable of finding all the regions of interest inside the
sequence, that is, (i) exons; (ii) introns; (iii) important control regions responding to particular
patterns or consensus sequences. More important, all these conceptually derived data should be
accessible exactly as the base, stored data is, that is, the language should provide a flexible view
definition mechanism to restructure information in the database. As an example, a simple way to
represent exons and introns inside a given sequence is to identify them by means of the starting
and ending position inside the sequence, but then, a primitive to retrieve the corresponding
subsequence is necessary.
Moreover, sequences often need to be transformed. For example, it should be possible to find
the reverse complement of a given sequence, or, given the set of exons of a gene inside a DNA
sequence, splice exons together and translate them into the corresponding protein primary
structure. This means that some primitives for computing sequence mappings are needed,
potentially introducing new sequences in the database.
On the basis of the these considerations, we can briefly sketch a set of important requirements
for a genome sequence query language as follows:

• the language should be expressive, that is, able to capture the complexity of genomic data,
both in terms of pattern extraction and sequence restructurings;
• the language has to be declarative and easy to use; it should have a clear semantics and a
� nice syntax, and a suitable view definition mechanism, so that it can eventually become
an effective tool for biologists;
• last but not least, the language should be safe in the sense that queries should always
terminate, and the semantics be tractable; note, in fact, that queries on sequences may
end up in nonterminating computations even when the base alphabet is finite; in fact, by
growing in lenght, sequences can easily become infinite.

It is apparent how a good query language is the result of a good compromise between expressive
power and effective computability. Various proposals have appeared in the recent literature about
languages for querying sequences. Indeed, in most cases the trade-off between expressiveness,
finiteness and effective computability is a hard one. In many cases [8], powerful logics for
expressing sequence transformations are proposed, but a great part of the expressive power has
to be sacrificed to enforce finiteness. In other cases [11], both expressiveness and finiteness are
achieved, but at expense of an effective procedure for evaluating of queries, i.e., at the expense
of an operational semantics.

References
[1] S. F. Altschul, W. Gish, W. Miller, E. W. Myers and D. J. Lipman. Basic Local Alignment
Search Tool. Jour. of Mol. Biology, 215:403–410, 1990.
[2] M. Atkinson, F. Bancilhon, D. DeWitt, K. Dittrich, D. Maier and Z. Zdonik. The object
oriented database manifesto. In 1st Intl. Conference on Deductive and Object Oriented
Databases (DOOD), Kyoto, 1989.
[3] C. Burks, M. Cassidy, M. J. Cinkosky, K. E. Cumella, P. Gilna, J. Hayden, G. M. Keen, T. A. Kel-
ley, M. Kelly, D. Kristofferson and J. Ryals. GenBank. Nucleic Acids Res., 19(Suppl.):2221–
2225, 1991.
[4] S. Ceri, G. Gottlob, and L. Tanca. Logic Programming and Data Bases. Springer-Verlag,
1989.
[5] W. Clocksin, C. Mellish. Programming in Prolog. Springer-Verlag, 1981.
[6] J. Collado Vides. The search for a grammatical theory of gene regulation is formally
justified by showing the inadequacy of context–free grammars. Computer Appications in
the Biosciences, 7(3), 1991.
[7] Department of Energy. Primer on Molecular Genetics, 1994.
[8] S. Ginsburg and X. Wang. Pattern matching by RS-Operations: Towards a unified approach
to querying sequenced data. In 11th ACM SIGMOD Symp. on Principles of Database Systems
(PODS), 1992.
[9] N. Goodman, S. Rozen and L. Stein. A glimpse at the DBMS Challenges Posed by the
Human Genome Project. Whitehead Institute for Biomedical Research, 1994.
[10] N. Goodman, S. Rozen and L. Stein. Requirements for a Deductive Query Language in the
MapBase Genome Mapping Database. In Workshop on Programming with Logic Databases
(following ILPS ’93), 1993.
[11] G. Grahne, M. Nykanen and E. Ukkonen. Reasoning about strings in databases. In 13th
ACM SIGMOD Symp. on Principles of Database Systems (PODS), 1994.
�[12] C. Hegelsen and P. R. Sibbald. PALM – A pattern language for molecular biology. In Proc.
First Int. Conference on Intelligent Systems for Molecular Biology, 1993.
[13] J. E. Hopcroft and J. D. Ullman. Introduction to automata theory, languages and computation.
Addison-Wesley, 1979.
[14] L. Hunter. Molecular Biology for computer scientists. In L.Hunter, editor. Artificial Intelli-
gence and Molecular Biology, AAAI Press, 1994.
[15] G. Mecca and A. J. Bonner. Sequences, Datalog and Transducers. To appear in 14th ACM
SIGMOD Symp. on Principles of Database Systems (PODS), 1995.
[16] W. B. Pearson and D. Lipman. Improved tools for biological sequence comparison. In Proc.
Nat. Academy of Science, 85(8):2444–2448, 1988.
[17] D. B. Searls. Representing genetic information with formal grammars. In Proc. Nat. Conf.
of the American Association for Artificial Intelligence, 7:386–391, 1988.
[18] D. B. Searls. Investigating the linguistic of DNA with Definite Clause Grammars. In Proc.
North Am. Conference on Logic Programming, pages 189–208, 1989.
[19] J. Tierry-Mieg and R. Durbin. ACeDB: A database for genomic information. In Proceedings
of the Genome Mapping and Sequencing Workshop, Cold Spring Harbor Laboratory, New
York, 1992.
�

Querying Genome Databases[edit]

Querying Genome Databases
Extended Abstract

Giansalvatore Mecca1
1
Dipartimento di Matematica, Informatica ed Economia, Universitá della Basilicata

After splicing, the two exons are spliced together, so that the sequence to translate and the
resulting protein are the following.
a u g c g g g a a u g a

Here, met, arg and val stand for the methionine (M), arginine (A) and glutamic acid (G) aminoacids,
respectively.