Documentos de Académico
Documentos de Profesional
Documentos de Cultura
This paper is intended as a review of molecular biology databases and other Bioinformatics resources available
for biotechnologists aiming to use the wealth of genomic data available today. The genomic data along with
associated proteomic and functional data are often distributed across multiple databases, requiring a timeconsuming search by the user. The explosion of information seen in molecular biology has created a veritable maze,
through which careful navigation is required for research and innovation in biotechnology. The paper, one of the
series, introduces the readers to the major molecular biology databases and bioinformatics tools such as BLAST for
similarity searching and RasMol for protein structure visualization. Subsequent papers will take the readers into a
journey across bioinformatics and the biotechnological discoveries that are happening with bioinformatics. Advances
in computer technologies and the birth of the internet are also part of this revolution in biology. Online databases
have given scientists and researchers across the world access to unimaginable volumes of biologically relevant data.
Bioinformatics, a truly multidisciplinary science, aims to use the benefits of computer technologies in understanding
the biology of life itself.
Keywords: bioinformatics, biological databases, alignment, Entrez, SRS, BLAST
1. Introduction
The announcement of the completion of a ' working
draft' of the human genome on June 26, 2000,
captured the imagination of people across the world in
a way that science and technology had not done since
man walked on the moon. Translating the 3 billion
characters in the DNA sequences that make up the
human genome into biologically meaningful
information has given rise to a new field Bioinformatics. When the Human Genome Project
was conceived of in 1987, the field of bioinformatics
was barely in its infancy. Today, the science of
bioinformatics has become a recognized discipline on
its own - born out of the necessity to bring together
information sciences and the biological sciences in
understanding the wealth of data that has been created
through the various genomics, proteomics and
functional genomics projects around the world. This
paper is intended to introduce bioinformatics to
scientists and biotechnologists who are beginning to
explore and use the tools of bioinformatics in making
new advances and discoveries in the field of biology.
The first question today in the mind of many
scientists is "What is bioinformatics?" Bioinformatics
102
Gene Identification
BLAST search ing of plant
genome databases
fmgrll~llt int o
Translation
Pr'olein
' --
I,
products .. --_... B
..
"
..
-.
...
..
..
'
Function Identification
BLAST searching of
Protein databases
Literature Search
I ... __ .._~ ... ...metabolites
Pal l"J'...."ays
Path Comparison
Search pathways
databases
103
Database Name
Link
Contents
http://www.ncbi.nlm.nih.gov/Genbank
MetaFam
Membrane Protein Database
TRANSFAC
http://www.ddbj.nig.ac .jp
http://pir.georgetown. ed u
http://ww w.expasy.c h/s prot
http://www .eb i.ac.uk/inte rpro
http://metafam.ahc. umn .edu/
h ltp://bi o phys.bio.tuat.ac.jp/ohshimaJdata
basc/
hltp://tran sfac. gbf.delTRA NSF AC/
http://www.rcsb.org/pdb/
NCBI'sMMDB
SCOP
HGBASE
OMIM
R
http://hgbase.cgr.ki .se/
http://www.ncbi.nlm.nih.gov/OMIM/
http://www.ncbi.nlm.nih .gov/GEO
http://genome-www4.Stanford.eelu
www.gcnoplante.org
UK CropNet
http://ukcrop.net/
Conld.-
104
Database Name
Link
Contents
http://www.ncbi.nlm.nih.gov/PMGifs/Ge
nomes/mi cr. html
http://m icrobialgenome.org/
http://www.ti gr.org/tigrscripts/CMR2/CMRHomePage .spl
http://wiLintegratedgenomics.com/GOLD/
http ://www.fruitfly.org
http://I 33. 11.149.55/
http://www .i nforma ti cs. jax. org/
htlp://genomewww.stanford.edu/Saccharomyces/
& hllp://rgp.dna.affrc.go. jp/
http://zmdb. iastate.ed u
Specialized Resources
a) Plant Databases. The completion of the
sequencing of the entire genome of the model plant
Arabidopsis thaliana (Arabidopsis Genome Initiative,
2000) is hailed as the beginning of a new era by the
plant biotechnologists. Various efforts in sequencing
of the genomes of major crop plants are underway and
will be completed shortly, scientists are now faced
with challenge of identifying new plant genes,
understanding the functions of newly discovered plant
105
106
TIGR
Comprehensive
Microbial
Resource--ln
10
0.1
D. rodiodurons
E.coli
~(')(h
0.01 \
0.001
I
012345678
Radiation (kGy)
Fig. 3--Rad iation resistance in D. radiodurans
10
107
P00090
POOOOI
POOO9O
1 ""~~X~I""AIIZWIl<-__
D ,5 + -+1'
OC!r .... x+ uP 1. Q+GB:I.G:t. a++::f+ !I '.
G+
1 <HI'AlGIAVI'-_ _~_
59 ~-~_
+1f ++ ++ 1]. +1' Y+
7IN., +
1)+0 AYJ. .M' ''''
57 VIJTlITiJ[UAILP~~-~HLII
++.
'8
56
105-:
114
(b)
P13569
1221
P33593
13
PI3569
1274
P33593
71
PJ3569
1323
P33593
117
l!iIIGDIIJIHlSF'~JI1'I!GIIIO:tIlGY5
+
++ +s ++ G+". LotG +<:SCIQij +.A 1. -+J.
'I' CBl' DG.
1273
~~="",",_
70
~-~""~SJIOE%WJmD1V
1.
0 llAY ..
... -f -i- +
-+ .J:ASH-
J3Z2
1'V
LD V
1379
~VL<:ZI<1'Pr~
174
~~
:L. n
VL
.... G
OJl*A+VL++
++nt:P+-
116
108
exact match searches, and only then is SmithWaterman invoked. This approach permits FASTA or
BLAST to run 10 to 100 times faster than
conventional Smith-Waterman, at the cost of missing
a few alignments. Some of the adjustable parameters
described below provide the user the flexibility to
trade-off between speed and accuracy. BLAST, in
general, tends to be faster and are more sensitive
(detects more alignments), but FAST A returns fewer
false hits.
Search Parameters. The effectiveness of alignment
algorithms depends on its parameters: a careful choice
is necessary, without which important alignments may
be overlooked or too many spurious alignments may
be returned. There are three sets of parameters that
can be specified by the user to control the results: the
alignment parameters, algorithmic parameters, and
output parameters.
=> The alignment parameters include the choice of
substitution matrix and the costs associated
with gaps. The substitution matrix is the cost
associated with substituting one residue with
another in aligning two protein sequences. The
most popular substitution matrices are the
PAM (Schwart & Dayhoff, 1978) and
BLOSUM (Henikoff & Henikoff, 1992) family
of matrices. The gap cost parameters involve a
cost associated with opening a gap and a lesser
cost associated with extension of a gap.
=> The different algorithmic parameters mostly
control the heuristics on which BLAST and
FAST A rely and hence allow the user to
control the speed and accuracy of their
alignments. We refer the reader to the online
manuals available at the BLAST and FASTA
sites referred above for a detailed description
of these algorithmic parameters.
Input Sequence
Common Use
BLASTni
FASTA
BLASTpi
FASTA
Nucleotide
Nucleotide
Protein
Protein
BLASTxi
FASTx
Nucleotide (translated)
Protein
TBLASTni
IFASTx
TBLASTx!
tFASTx
Protein
Nucleotide (translated)
Nucleotide (translated)
Nucleotide (translated)
109
110
Integration of Data
==> A basic problem underlying the integration of
heterogeneous databases is the autonomy of the
sources, which has led to lack of cooperation
and non-standardization of formats with some
notable exceptions. For example, Genbank,
EMBL and the DNA Data Bank of Japan
(DDBJ) cooperate in creating a centralized
repository for the human genome sequence
data and daily exchange of data is made for the
purpose of synchronization. A cooperation!colicensing agreement is the first step in the
creation of integrated systems.
==> As data is exchanged between heterogeneous
systems, schema converters are required
(which convert data from one schema to either
the schema of another database or a global
schema).
General
schema
integration
methodologies have been discussed (Batini et
al, 1986) and further evaluated in the context
of biological databases by (Buneman et al,
1995).
==> Data conflicts and errors need to be resolved in
a systematic manner.
Integration of User Interfaces
==> Browsing Interface: Presenting a unified view
of the data is done in one of two possible ways
(Markowitz & Ritter, 1995): (a) A global
schema is created by unifying the schema of
the component databases; or (b) Local views of
the data which use the "local" schema of the
component databases.
==> Query Interface: Each of the component
databases may support different types of
Data Warehousing
Data Warehouses represent the materialization of a
global schema, i.e. the integrated database is defined
by the global schema and loaded with data from the
component databases. The steps involved in creating a
data warehouse are:
Downloading of data from the component
databases
Data cleaning (removal of erroneous entries and
resolving data cOl)flicts)
Reformatting the data into the global schema
A data warehouse is often confused with a
consolidation of mUltiple databases. In consolidating
multiple databases, the component databases are
subsumed into a larger database and the individual
component databases are discarded, whereas in data
warehousing, the component databases are not
disturbed. Consolidation is far more complex and
expensive, requiring consensus on common names,
data structures, and policies. Furthermore, existing
applications on component databases must be
converted in order to function on the consolidated
system. The relative advantages and disadvantages of
data warehousing systems are listed in Table 3.
Current data warehousing systems used in molecular
biology databases are:
111
Advantages
Disadvantages
GUS-Genomics
Unified
Schema-Genomics
112
Disadvan tages
OM 1M
I_I
r-----
I~ Full-tex~
r-~-' //
<~urnaI5
-1.-,r :"1
r--M-a-p-'-s""'-'-&--'
Genomes
ElectrOniC
- - - - - - -I=~
' ':m::..t-.1.. - ....-::.::~-_<_:.-_-.._
...:...t_ _- ,
Qur#lfJ qwry:
113
'{EMBirlD:AB034639]"
!i'1nd oU J!1IJm~
ReCenm.~es
,..., r
MEDUNE;
gQ
hfore~~; - mbsIWm
~
MEDUNE (Main Rdease)
ldEDI-INE <yp~)
Sequence libraries
:!!!. r ~
r SWALqSPTR) r !L
r
r
PATENT PRT
JPO PRT
PATENT DNA
RemTrEMBL
USPO PRT
r
r
ENSE-\tB1IMGTlLIGM-DB
IMGTHLA
r
r
SWISSPROT
SpTfEMBL
.~----------.----~------~-
,. .-. --.::.-.OWlY
LlTOO
114
Semantic Integration
The ultimate aim of the LinkDB systems is to add
to its current hyperlinks using biological meanings
and relations: For example, known protein-protein
interactions should be represented through a bidirectional hyperlink between the two protein
sequence entries. However, this kind of semantic
integration is still in its infancy and the focus of much
bioinformatics research. An effort on semantic
integration worth mentioning is the development of
XML ontologies and some early work on TAMBIS.
XML Ontologies. In a discipline-wide effort to
standardize the representation of entries in biological
databases. various organizations and institutions are
participating in the creating of XML ontologies.
XML: Better than html (the language used in
creation of web pages), XML (eXtensible
Markup Language) has emerged as the de facto
Description
AnatML
CellML
Clinical Trial Data Model
Gene Expression Markup Language (GEML)
GeneOntology Markup Language
115
116