Ijbt 1 (1) 101-116

Indian Journal of Biotechnology
Vol I, January 2002, pp 101-116
Bioinformatics: Advancing Biotechnology through Information Technology Part I:

Molecular Biology Databases
Sudeshna Adak* and Biplav Srivastava
IBM India Research Lab, Block I, lIT Campus, Hauz Khas, New Delhi 110016, India
This paper is intended as a review of molecular biology databases and other Bioinformatics resources available
for biotechnologists aiming to use the wealth of genomic data available today. The genomic data along with
associated proteomic and functional data are often distributed across multiple databases, requiring a timeconsuming search by the user. The explosion of information seen in molecular biology has created a veritable maze,
through which careful navigation is required for research and innovation in biotechnology. The paper, one of the
series, introduces the readers to the major molecular biology databases and bioinformatics tools such as BLAST for
similarity searching and RasMol for protein structure visualization. Subsequent papers will take the readers into a
journey across bioinformatics and the biotechnological discoveries that are happening with bioinformatics. Advances
in computer technologies and the birth of the internet are also part of this revolution in biology. Online databases
have given scientists and researchers across the world access to unimaginable volumes of biologically relevant data.
Bioinformatics, a truly multidisciplinary science, aims to use the benefits of computer technologies in understanding
the biology of life itself.
Keywords: bioinformatics, biological databases, alignment, Entrez, SRS, BLAST
1. Introduction
The announcement of the completion of a ' working
draft' of the human genome on June 26, 2000,
captured the imagination of people across the world in
a way that science and technology had not done since
man walked on the moon. Translating the 3 billion
characters in the DNA sequences that make up the
human genome into biologically meaningful
information has given rise to a new field Bioinformatics. When the Human Genome Project
was conceived of in 1987, the field of bioinformatics
was barely in its infancy. Today, the science of
bioinformatics has become a recognized discipline on
its own - born out of the necessity to bring together
information sciences and the biological sciences in
understanding the wealth of data that has been created
through the various genomics, proteomics and
functional genomics projects around the world. This
paper is intended to introduce bioinformatics to
scientists and biotechnologists who are beginning to
explore and use the tools of bioinformatics in making
new advances and discoveries in the field of biology.
The first question today in the mind of many
scientists is "What is bioinformatics?" Bioinformatics
*Author for correspondence:

Tel: 91-11-6861100
E-mail: asudesn@in.ibm .com
has been touted as in-silico biology, where wet lab

experimental biology can (perhaps) be replaced with
of
computers.
A
more
precise
definition
bioinformatics is the application of information
sciences (mathematics, statistIcs and computer
science) to increase our understanding of biology.
Probably, the
most remarkable success of
bioinformatics to date has been its use in the 'shotgun
sequencing' of the human genome.
In shotgun sequencing (Bankier et ai, 1987), a large
piece of DNA is broken up randomly into smaller
fragments . The smaller fragments are subcloned, their
ends sequenced, and the fragments are reassembled
based on overlaps (Fig. 1). This approach rapidly
reveals 90% of the desired sequence information, and
the remaining few gaps are filled by custom
oligonucleotide primers (Waterston & Sulston , 1995).
Sequence assembly is still one of the primary uses of
bioinformatics in the various sequencing projects
underway today.
Bioinformatics, as a subject, consists of three core
areas :
Sequence Comparison and Sequence Analysis
The Emerging Technology of Microarrays
For detailed discussion on these topics, the readers
are referred to Baxevanis & Ouellete (1998) ; Gibas &
Jambeck (2001); Rashidi & Buehler (1999); Mount
INDIAN J BIOTECHNOL, JANUARY 2002
102
(2001); and Misener & Krawetz (2000) .This paper

first in seri es of 3 provides a review of the first core
area of molecular biology databases. The subsequent
papers, wi II review the two other core areas of
bioinformatics. The review of molecul ar biology
databases is intended to provide the answers to two
ki nds of questions faced by the biotechnologists in
using bioinformatics:
1 What are the resources currently available and
what is their potenti al
usefulness in
biotechnology? (Sections 2 and 3 of thi s paper
provide a review of these resources).
2 What are the important aspects of database
technology that are particularly relevant in
creating new biological databases and adding
value to existing biological databases? (Section
4 of this paper discusses the main value add in
molec ul ar biology databases today: seamless
of
multiple,
heterogeneous
integration
databases th at provides the user a single point
of entry to a variety of resources).
3 While sequencing of the human genome has
captured the attention of the scientific
community at large, similar efforts for other
organisms and the creation of databases in
related areas of proteomics and functional
genomics has not received the same publicity.
However, it is the combined use of a variety of
biological databases that will have an impact
on every industry that uses biotechnology

today such as: pharmaceuticals, agriculture,
foren sics, bioremediation and biofuels , and
other biochemical industrial processes. It is
clear that an understanding of the combination
of resources of molecular biology databases is
necessary for the modern biotechnol ogist. A
hypothetical
scenario,
illustrating
a
combin ation use of such resources leading to a
new biological discovery is given below.
Use Case Scenario-A scientist at a plant

biotechnology company is interested in understanding
the genetic basis of fruit development; specifically ,
the user is interested in identifying the genes that are
involved in ripening gree n strawben'ies into red
strawberries and also to determine the biochemical
pathways involved in the ripenin g process. Fig. 2
illustrates a comparative genomics approach that can
be used in a purely in-silico effort to determine the
genes involved in ripening of strawberries. In this
approach, the genome of the straw berry fruit is
compared to the annotated genomes of sim ilar
spec ies, to identify the genes an d their associated
func tions.
2. Molecular Biology Databases
Most biologists and biotechnologists are familiar
with the more well-known nucleotide sequence
Gnom~
Gene Identification
BLAST search ing of plant
genome databases
Purifv DNA of int('l'(~ land

SUl aU pi
("3
fmgrll~llt int o
Translation
Pr'olein
' --
I,
products .. --_... B
..
"
..
-.
...
..
..
'
JOC Jem i e.;]
Function Identification
BLAST searching of
Protein databases
Literature Search
I ... __ .._~ ... ...metabolites
Pal l"J'...."ays
Path Comparison
Search pathways
databases
,..~. .~.. ~-'~:...

~ '..,+ ----,Ph C'f10 typ c-
Fig. l---Shotgun sequencing
Fig. 2-From strawberry ge nome to phenotype (colour: red and

flavour: sweet)
ADAK & SRIV ASTA V A: BIOINFORMATICS-MOLECULAR BIOLOGY DATABASES
103
Table I-Major Biological Databases
Database Name
Link
Contents
Molecular Biology Database Catalogs

Infobiogen Catalogue of
The Molecular Biology
Database Collection.
The European Bioinformatics
Institute Biocatalogue
ht tp://www .in fobiogc n.fr/scrv ices/dbcati

http://n ar. ou p jou rnal s.org/cgi/con ten t/fu II
/29/ III /DC I.
http://www.eb i.ac.uklbiocati
Major Biomedical Literature Databases

NCBI's PubMed
http://www.ncbi.n lm.nih. gov/PubMcd
Medline and Pre-Medline Citations
Major Nucleotide Sequence Databases

NCBI's Genbank
http://www.ncbi.nlm.nih.gov/Genbank
EMBL Nucleotide Sequence

Database
DNA Data Bank of Japan
(DDBJ)
Major Protein Databases
Protein Information Resource
(PIR)
Swiss-ProtffrEMBL
InterPro
hltp://www.ebi .ac.uk/e mbl{
MetaFam
Membrane Protein Database
TRANSFAC
http://www.ddbj.nig.ac .jp
http://pir.georgetown. ed u
http://ww w.expasy.c h/s prot
http://www .eb i.ac.uk/inte rpro
http://metafam.ahc. umn .edu/
h ltp://bi o phys.bio.tuat.ac.jp/ohshimaJdata
basc/
hltp://tran sfac. gbf.delTRA NSF AC/
All known nucleotide and protein sequences:

International Nucleotide Sequence Data Collaboration
All known nucleotide and protein seq uences:
Internati onal Nucleotide Sequence Data Collaboration
All known nu cleotide and protein sequences:
International Nucleotide Sequence Data Collaboration
Comprehensi ve, annotated, non- redundant protein
seq uen ce database
Curated protein seq uences
Integrated resource for protein families, domains, and
sites
Integrated protein family informati on
Membrane seque nces, transmembrane regions and
structures
Transcription Factors and Binding
Sites
Major Protein Structure Databases

Protein Data Bank (PDB)
http://www.rcsb.org/pdb/
NCBI'sMMDB
http://www . ncbi. n Im.ni h. goy/Structure
SCOP
http://scop.mrc-I mb.cam.ac .uklscop/
Structure data determined by X-ray crystallography

and NMR
All experimentally-determined 3D protein structures
linked to NCBI's E ntrez
Familial and Structural protein relationships
Major Mutation Databases

NCBI's dbSNP
ALFRED
http://www .ncbi .nlm .nih .gov/SNP/

http://alfred.med.yale.eelu/alfred/i ndex. as
Databa se of single nucleotide polymorphisms

Allele frequencies and DNA polymorphisms
HGBASE
OMIM
R
http://hgbase.cgr.ki .se/
http://www.ncbi.nlm.nih.gov/OMIM/
Intragen ic sequence polymorphisms

Catalog of human ge netic and ge nomic disorders
Major Gene Express ion Databases

Gene Expression
Omnibus
Stanford Microarray Database
(SMD)
http://www.ncbi.nlm.nih .gov/GEO
http://genome-www4.Stanford.eelu
NCBI's Repository for gene expressio n (under

development)
Gateway to microarray data from Stanford labs
Major Plant Genome Databases

Genoplante
www.gcnoplante.org
Genomics for plant improvement
UK CropNet
http://ukcrop.net/
Comprehensive gateway to crop genomes
Conld.-
104
Table I-Major Biological Data bases--Colltd.
Database Name
Link
Contents
Major Microbial Genome Databases

NCBI's Microbial Genome
Gateway
DOE's Microbial Genomics
Gateway
TIGR's Comprehensive
Microbial Resources
http://www.ncbi.nlm.nih.gov/PMGifs/Ge
nomes/mi cr. html
http://m icrobialgenome.org/
http://www.ti gr.org/tigrscripts/CMR2/CMRHomePage .spl
Completed Microbi al Genomes
Major Organism-speci fic Genome Databases

Genomes Onlines Database
(GOLD)
Flybase
Full-Malaria
Mouse Genome Database
(MGD)
Arabidopsis th aliana genome
database
Saccaromyces Genome
Database (SGD)
Rice Genome Project
ZmDB
http://wiLintegratedgenomics.com/GOLD/
http ://www.fruitfly.org
http://I 33. 11.149.55/
http://www .i nforma ti cs. jax. org/
Information regarding complete and ongoing genome

projects
Drosophilia sequences and genomic information
Malaria full-Length cDNA Database
Mouse ge netics and genomics
http://www. tigr. org/ldb/e2k l /ath l/
Arabidopsis thalian a genome database
htlp://genomewww.stanford.edu/Saccharomyces/
& hllp://rgp.dna.affrc.go. jp/
http://zmdb. iastate.ed u
S. cerevisiae genome information
databases such as Genbank or EMBL as well as the

protein related databases of Swiss-Prot, PIR, and PDB
(Table 1). However, there are numerous specialized
biological databases that have been created out of a
particular need, either to answer a particular
biological question or to better serve a particular
segment of the biological community. The objective
in describing the molecular biology databases is to
better serve our readers in promoting the use of these
resources in the design and analysis of their
experiments. Comprehensive listing of molecular
biology databases are available through the following
web catalogues:
Infobiogen {Table 1) provides an online
catalogue of molecular biology databases. The
catalogue has a listing of 511 databases (as of
October 30, 2001), further categorized based
on content as DNA related: 87, RNA
related; 29, protein related; 94, genomic; 58,
mapping; 29, protein structure; 18, literature;
43 and miscellaneous; 153.
The Molecular Biology Database Collection
(Table 1) is an online catalogue of key
databases of value to the biological
community. This collection contains today a
list of 281 high-quality online databases. The
databases included in this collection are
considered
particularly
relevant
for
biotechnology research as they provide new
Reportin g current data in the rice genome project

Reporting current data in the maize genome project
value to the underlying data by virtue of

curation, new data connections or other
innovative approaches.
Since 1993, the European Bioinformatics
Institute has been maintaining Biocatalogue
(Table 1). The Biocatalogue is a software
directory of general interest in molecular
biology and genetics. It is also categorized into
key areas similar to the lnfobiogen catalogue.
It is beyond the scope of this paper to describe the
plethora of molecular biology databases available
today. The readers are referred to the Infobiogen
Catalogue and the Molecular Database Collection for
a comprehensive listing. The major well-known
biological databases are listed in Table 1. Some
specialized databases which are perhaps of more
interest to biotechnologists and molecular biologists,
are listed below.
Specialized Resources
a) Plant Databases. The completion of the
sequencing of the entire genome of the model plant
Arabidopsis thaliana (Arabidopsis Genome Initiative,
2000) is hailed as the beginning of a new era by the
plant biotechnologists. Various efforts in sequencing
of the genomes of major crop plants are underway and
will be completed shortly, scientists are now faced
with challenge of identifying new plant genes,
understanding the functions of newly discovered plant
genes and "reaping the plant gene harvest" by using

this new information in improving crop yield. While
plant genomics (i.e. unraveling the functions of plant
genes based on whole genome sequencing) is still in
its infancy, plant biotechnologists have been more
active in creating proteomics and metabolic profile
databases. Proteomics uses two-dimensional (2D) gel
electrophoresis to separate the proteins in a cell or
tissue by size and pH characteristic, followed by mass
spectrometry to help identify each component of the
resulting gel pattern. Using this technology,
researchers are building protein expression pattern
databases for Arabidopsis, rice, maize, and pine trees
(see http://www.expasy.ch/ch2d/2d-index.html for
links to these databases). The databases provide
pictures of 2D gels with links to previous publications
on proteins of interest, the plant tissue from which it
was derived and sequence data for proteins. However,
as an increasing number of plant genomes become
available, plant genomics will play an even more
significant role in plant biotechnology. Two gateways
to the emerging area of plant genomics are reviewed.
UK CropNet-Cbmparative genomics (ascribing

function through alignment of similar nucleotide or
protein sequences) has become a key area of research
in plant biotechnology. This is because genomes of
closely related plant species have been found to have
remarkably similar genes and gene functions. As the
vast amount of plant genomic data becoming
available, the use of bioinformatics to improve plant
varieties is also becoming vital. To make sense of all
the genomic data, UK CropNet was established in
1996 with specific aims of developing software and
databases that will facilitate the querying of genomic
information from different crop species. Particular
emphasis has been placed on developing software
tools for comparative mapping. UK CropNet has used
the AceDB (Durbin & Thierry-Mieg, 1992) database
system to create separate databases for each of the UK
CropNet projects with individual databases for
Arabidopsis, Barley, Brassica, Forage grasses, and
Millet.
Genoplante-Genoplante is a remarkable instance
of collaboration between publicly funded institutions
and private organizations in furthering scientific
research in genomics and bioinformatics, specifically
for plants. Genoplante is a major partnership
programme in plant genomics, which links public
research in France (INRA, CIRAD, IRD, CNRS) and
the main private companies involved in crop
improvement and protection (Biogemma, Aventis
105
CropScience. Bioplante). Genoplante is part of the

fierce international competition in the science of plant
genomics, as can be seen from the variety of world
programmes which are being created. Genoplante is
the French answer - and tomorrow the European
answer - to this major scientific and economic
challenge. By pooling their knowledge and their
finances in a structured research network, the public
and private members of the programme are playing
the synergy card, whilst following their own
programmes at the same time. The objective of
Genoplante is to create a network of laboratories
across Europe, which will pool their combined
resources to discover new plant genes, study the
genomics of model plant species like Arabidopsis and
rice, and conduct genome-based research on major
crops under cultivation in Europe.
b) Microbial Databases. Micro-organisms (viruses,
bacteria, fungi, protozoa and algae) hold the key to
maintaining the earth's ecological system. The unique
properties of microbes represent an extremely
valuable resource for biotechnology and are key
elements in breakthrough in more effective and safer
vaccines, identification of new drug and chemical
targets in pathogens, improved industrial catalysts,
bioremediation, and perhaps clues to the origin of the
Earth. However, as there are thousands of microbes
known to exist on earth, systematic understanding of
the biology of such a large number of organisms was
a daunting task until recently (less than 1% of
microbes on earth have been cultured and studied in
the laboratory). Whole genome sequencing represents
an important step as it can help accelerate the process
of understanding of a microbe's biological capabilities and lead to real impact in the field of
biotechnology. Till date, there are 59 complete,
annotated microbial genomes with 17 more whose
sequencing is complete and annotation is under
development. For an additional impressive number
(- 200 to 300) of microbial genomes, sequencing is
currently under progress. Initial analysis of available
microbial genome data has already led to some
surprising results: 20-30% of genes encode unknown
proteins apparently unique to the species and 40-50%
of genes encode proteins of unknown function. Some
of the important databases that are gateways to
microbial genomes (Table 1 for links) and will prove
a key resource in microbial biotechnology research
are as under:
NCBI's Microbial Genomes Gateway-The gateway

provides accepted to complete and unfinished
106
microbial genomes, including sequence and structural

similarity searching tools. It also provides the
taxonomic information of the microbial species.
Microbial Genomics Gateway-This is the portal to
the US Department of Energy (DOE),s Microbial
Genome Programme. It provides a comprehensive set
of links to web pages for microbes under study around
the world. The portal also provides links to DOE's
research on genetic eng ineering and biotechnology
using microbes. For example, DOE scientists finished
sequencing the radiation-resistant bacterium (Fig. 3)
Deinococcus radiodurans (White et ai, 1999) are now
investigating if the genome of D. radiodurans can be
altered to increase its potential usefulness in cleaning
up toxic waste around the globe (Melin et ai, 2001).
Through the use of biotechnological processes,
scientists hope to transfer genes from other organisms
that will enable the bacterium to degrade toxic
chemicals such as toluene found in mixed chemical
and radiation waste sites.
TIGR
Comprehensive
Microbial
Resource--ln
addition to the links to microbial genomes (complete,

annotated and sequencing in progress), TlGR
provides a set of bioinformatics tools (similarity
search and gene identification) that are geared
specifically for microbial genomes.
c) Mutation Databases. A key aspect of research in
genetic engineering is understanding how mutation
(variation in the DNA sequence) affects different
phenotypes (characteristics of the organisms). In
humans , many mutations are harmless but some are
not and most diseases are associated with mutations.
Moreover, inherited mutations in humans are mostly
single nucleotide polymorphisms (SNPs), which
occ ur every 100 to 300 bases. Because of the
importance of identifying such mutations in the study
of the genetic basis of diseases, ten of the major
pharmaceutical companies in the world have come
100
10
together in an unprecedented sc ientific collaboration

to form the SNP consortium. The objective of the
SNP consortium is to create a database of all know
human SNPs. Here two of the maj or gateways to
human mutation databases are discussed. A complete
catalogue of human mutation databases is available at
http://www.uta.fi/lai tok set/imt/bioinfo/B TKbase/data
base.html. Interestingly , the effect of mutation s in
crops
is
being
actively
studied by plant
biotechnologists, but there is currently no public
database available cataloguing such mutations. While
creation of human mutation databases involves
expensive and time-consuming sc reening, the effect
of plant mutations can be studied through
manipulation of plant genes. Biotechnology methods
cUlTently being used to study mutations in plants
include:
Chimeraplasty: Creates SNPs in plant genes

(Zuo & Chua, 2000).
Trait utility system: Creates random mutations
in a large number of genes in fertile maize
plants by inserting DNA that can jump in and
out of genes, and the resulting plants are
sc reened for interesting changes such as
drought resistance or sweeter kernels (Gura,
2000).
Activation Tagging: Generates wholesale
mutations in plants by inserting DNA
enhancers via a plant-cell infecting bacterium
(Wiegel et ai, 2000).
3. Applications for Molecular Biology Databases

Biological databases often provide bioinformatics
applications as part of the user interface. There are
three primary types of bioinformatics tools that are
commonly coupled with the databases: text-based
database
searching ,
similarity-based
database
.searching, and visualization tools. The most
frequently used of these bioinformatics tools are :
Text-based Database Searching

~
0.1
D. rodiodurons
E.coli
~(')(h
0.01 \
0.001
I
012345678
Radiation (kGy)
Fig. 3--Rad iation resistance in D. radiodurans
10
Most publicly available biolog ical databases

provide a text-based search interface that allows
retrieval of entries th at " match" user-spec ified
word(s) or phrases . Advanced search features
typically include boolean searches (combining terms
with orland/not), wild cards, etc. The example given
below, demonstrates the capabilities and limitation of
text-based searchin g in NCBI ' s P ubMed database
(Table 1).
ADAK & SRIVASTAVA: BIOINFORMATICS-MOLECULAR BIOLOGY DATABASES
Use Case Scenario--The goal is to search PubMed

to determine possible functions of the yeast gene
MDM2. A query on PubMed for "yeast AND
MDM2" retrieved 39 citations. A quick scan shows
that many of the articles also refer to the p53
oncogene and a careful analysis showed that the
MDM2 gene inhibits p53 and apoptosis by binding to
it. But, this is primarily in humans. To investigate
other functions of MDM2, specific to yeast, the
refined search using "NOT p53"-resuJted in 8
articles, one of which talks about the role of MDM2
in fatty acid metabolism. Thus, careful searches and
sifting through the biological literature can help in
discovering all the possible functions known to be
associated with genes.
The quality of results of a text-based search
depends on the quality of the contents of the database.
Hence, in either conducting a text-based search and
even more importantly while designing biological
databases and enabling text searches in newly created
databases, it is important to keep the following in
mind:
If the database contains free-form text, spelling

en-ors can exclude relevant entries. Inconsistencies may also result in relevant errors
being excluded (for example, IL-2 and IL2 are
both used to refer to Interleukin-2).
Problems with free-form text can be avoided
by the use of keywords. However, authorsupplied keywords can also be arbitrary and
inconsistent.
The best solution is the use of a controlled
vocabulary. For example, PubMed uses an
extensive controlled vocabulary called MeSH
(medical subject headings) . However, it is
important to understand the organization and
hierarchy of such a controlled vocabu lary if
used in searching.
Similarity-based Database Searching: BLAST and

FASTA
With large-scale genome sequencing projects, the
flood of DNA sequence data coming into public
databases is staggering. Researchers are increasingl y
relying on infen-ing the function of putative genes
through similarity to well-characterized proteins. It is
important to realize that designing an in-silico
sequence similarity search needs to be as carefully
designed as a wet lab experiment in order to get
biologically meaningful results. In this paper, some of
the issues in using the most popular sequence
107
similarity search tools are reviewed (Durbin et aI,

1999; Gusfield, 1997).
Sequence similarity searches use alignments to
determine a "match". Alignment of two sequences is
matching of the two sequences, except that they allow
the most common mutations: insertions, deletions,
and single-character substi tution s.
The basic
considerations in using a sequence-similarity search
are:
Global VS. Local Alignment. Global alignment
forces complete al ignment of the input sequences,
whereas local alignment will align the most similar
segments. The choice of global vs. local depends on
the assumptions made by the user as to whether the
sequences are related over their entire length or
presumed to share only isolated regions of homology.
As similarities will span segments rather than entire
sequences, local alignment is the most popular
database si milarity search. See Fig. 4 for a sample
output of global alignment (a) and local alignment (b).
Alignment Algorithms. There are a variety of
relatively efficient alignment algorithms, each of
which aim to determine the most optimal alignment.
The first of these to be described in the biological
literature was the Needleman-Wusnch algorithm
(Needleman & Wunsch, 1970) for optimal global
alignment, follow ed by a slight variant, the SmithWaterman algorithm (Smith & Waterman, 1981) for
optimal local alignment of two seq uences. These two
methods were developed prior to wholesale genome
sequencing. Today, the special purpose parallel
machines and the massive computation time required
by these algorithms have rendered them almost
obsolete. Most users prefer BLAST (Altschul
et ai, 1990) (http: //www.ncbi.nlm.nih.gov/BLAST)
or
FASTA
(Pearson
&
Lipman,
1988)
(http://www.ebi.ac. uklfasta33) .
which
rely
on
heuristic strategies to speed up alignment searches.
Promising region s are first determin ed through rapi d
(a)
POOOO I
P00090
POOOOI
POOO9O
1 ""~~X~I""AIIZWIl<-__
D ,5 + -+1'
OC!r .... x+ uP 1. Q+GB:I.G:t. a++::f+ !I '.
G+
1 <HI'AlGIAVI'-_ _~_
59 ~-~_
+1f ++ ++ 1]. +1' Y+
7IN., +
1)+0 AYJ. .M' ''''
57 VIJTlITiJ[UAILP~~-~HLII
++.
'8
56
105-:
114
(b)
P13569
1221
P33593
13
PI3569
1274
P33593
71
PJ3569
1323
P33593
117
l!iIIGDIIJIHlSF'~JI1'I!GIIIO:tIlGY5
+
++ +s ++ G+". LotG +<:SCIQij +.A 1. -+J.
'I' CBl' DG.
1273
~~="",",_
70
~-~""~SJIOE%WJmD1V
1.
0 llAY ..
... -f -i- +
-+ .J:ASH-
J3Z2
1'V
LD V
1379
~VL<:ZI<1'Pr~
174
~~
:L. n
VL
.... G
OJl*A+VL++
++nt:P+-
Fig. 4--Gl obal (a) ancl local (b) alignmenls.
116
108
exact match searches, and only then is SmithWaterman invoked. This approach permits FASTA or
BLAST to run 10 to 100 times faster than
conventional Smith-Waterman, at the cost of missing
a few alignments. Some of the adjustable parameters
described below provide the user the flexibility to
trade-off between speed and accuracy. BLAST, in
general, tends to be faster and are more sensitive
(detects more alignments), but FAST A returns fewer
false hits.
Search Parameters. The effectiveness of alignment
algorithms depends on its parameters: a careful choice
is necessary, without which important alignments may
be overlooked or too many spurious alignments may
be returned. There are three sets of parameters that
can be specified by the user to control the results: the
alignment parameters, algorithmic parameters, and
output parameters.
=> The alignment parameters include the choice of
substitution matrix and the costs associated
with gaps. The substitution matrix is the cost
associated with substituting one residue with
another in aligning two protein sequences. The
most popular substitution matrices are the
PAM (Schwart & Dayhoff, 1978) and
BLOSUM (Henikoff & Henikoff, 1992) family
of matrices. The gap cost parameters involve a
cost associated with opening a gap and a lesser
cost associated with extension of a gap.
=> The different algorithmic parameters mostly
control the heuristics on which BLAST and
FAST A rely and hence allow the user to
control the speed and accuracy of their
alignments. We refer the reader to the online
manuals available at the BLAST and FASTA
sites referred above for a detailed description
of these algorithmic parameters.
=> The output parameters include a threshold for

the E-score and the desired number of matches.
The E-score is a measure of the statistical
significance of an alignment, where it
combines the raw score from the alignments,
the lengths of the query sequences, and the size
of the database. The E-score gives the expected
number of sequences in the database that
would align with the given raw score by
chance. Typically, in a database the size of
Genbank or Swiss-Prot, one expects random
matches of 5 to 10 sequences, and thus
alignments with E-scores less than 5 (or 10) are
ignored. However, in smaller databases, such
as the PDB, it is important to consider smaller
E-scores.
Various forms of BLAST and FAST A are used in
alignment of different types of biological sequences,
some of which are listed in Table 2.
Pair wise sequence alignment usi ng BLAST and
FASTA has been extended in two ways: (1) for
multiple alignment of DNA or protein sequences and
(2) structural alignment for determination of protein
structural neighbours, where the extension of BLAST
to 3-dimensional coordinates is called VAST (Gibrat
et ai, 1996). In multiple alignments, the most common
method called Clustal W (Gibson et ai, 1994) creates
a multiple alignment of DNA or protein sequences,
starting with BLAST or FAST A pai r wise alignment
scores. A more detailed discussion on these topics is
beyond the scope of this paper and we refer the
readers to see Durbin et ai, 1999.
Protein Structure Visualization: RasMol and

Kinemage
Most protein structure databases today come
equipped with vi sualization tools-the most fre-
Table 2- BLAST and FASTA variants for di fferent searches

Program
Input Sequence
Compari son Database
Common Use
BLASTni
FASTA
BLASTpi
FASTA
Nucleotide
Nucleotide
Protein
Protein
BLASTxi
FASTx
Nucleotide (translated)
Protein
TBLASTni
IFASTx
TBLASTx!
tFASTx
Protein
Al ign a new DNA sequence

to a nucleo tide seque nce database.
Seeks to a lign an amino acid que ry
sequence to a protein sequence
database.
Anal yze new DNA sequence
(translated) to fi nd potential cod ing
regions.
Useful for EST analysis.
Useful fo r EST analysis.
109
initials of the author of RasMol are R .A.S. is probably

only coincidental (Fig. 5).
RasMol. This is a molecular graphics software

that allows visualization of proteins, nucleic
acids , and small molecules, for which a 3dimensional structure is available. In order to
display a mol ecule, RasMol requires an atomic
coordinate file that specifies the position of
every atom in the molecule through its 3dimensional cartesian coordinates. RasMol
accepts this coordinate file in a variety of
formats , including the Protein Data Bank
(PDB) form at. The vi sualization provides the
user a choice of color schemes and molecular
representation s [wireframe, cy linder (Dreiding)
stick bonds, alpha-carbon trace, space fi lling
(CPK) spheres, macro-molecular ribbons
(either smooth shaded solid ribbons or parallel
strands), hydrogen bonding and dot surface].
Additional fe atures such as text labeling for
selected atoms, different colour schemes for
different parts of the molecule, zoom, rotation,
etc. have made this the most popular of all
visualization tools.
Chime and Protein Explorer are derivatives of
RasMol that allow visualization inside wet
browsers, whi le RasMol runs independently
outs ide a web browser.
Kinemage. One of the drawbacks of R:!sMol
was that it fai led to allow the user to move two
molecules or parts of a molecule complex,
relative to each other. For example, RasMol
cannot show the binding of a substrate to, or its
release from, an enzyme. This drawback was
corrected when kinemage (kinetic images) was
developed (Richardson & Richardson, 1989).
To quote the authors, "Kinemages are set up to
illustrate a particular idea about a threedi mensional object, rather than neutrally
displaying that object; they incorporate the
author's selection, emphasis, and viewpoint.. .. "
4. Integrated Molecular Biology Database Systems

Fig. 5--{a) RasMol [Phage CRO Repres sor on DNA. Andrew
Coulson & Roger Sayle with RasMol, University of Edinburgh,
1993] and (b) K:nemage images of protein structure
quently used being the freely available RasMol (Sayle

& Milner-White, 1995). The name "RasMol" is
derived from Raster (the array of pixels on a
computer screen) and Molecules. The fact that the
A database is a repository that provides a

centralized and homogeneous view of its contents .
The repository is created and modified through a
database management system (DBMS). Every data
item in the database is structured according to a
schema, defined as a set of pre-specified rules through
the data definition language . The contents of the
database can be typically accessed through a
110
graphical user inteiface (GUI) that allows browsing

through the contents of the repository very much
similar as one may browse through the books in a
library. Most databases also allow querying of its
contents through a specialized query language. The
data definition language and the query language form
the data model and define the semantics of the
manipulations and operations allowed on the
database.
For example, the schema of Genome Sequence
Data Base (GSDB) and the Mouse Genome Database
(MGD) are defined using the data definition language
of the Sybase relational DBMS, the structure of the
Arabidopsis thaliana database (AtDB) (as well as
numerous other genomic databases) is defined using
AceDB (Durbin and Thierry-Mieg, 1992), and the
structure of Genome Database (GDB) and the Protein
Data Bank (PDB) are defined using an object protocol
model (OPM) (Chen & Markowitz, 1995) that allows
storage of images on top of a relational database
management system. For molecular biology
databanks maintained as files, the data definition
languages used for defining their structure are not
based on a data model per se and range from generic
notations such as the ASN.l data exchange format
used for Genbank to ad-hoc data definition languages
such as that employed for EMBL.
It is clear that comprehensive studies of molecular
biology data involve exploring multiple databases.
Rather than requiring the user to combine information
retrieved from multiple databases, it is clearly in our
interest to provide an integration of the databases. The
particular challenges of integrating biological data
sources (as compared to heterogeneous data sources
from other domains) have been discussed (Davidson
et ai, 1995; Markowitz & Ritter, 1995; Karp, 1995). It
is clear that the main hurdle in the integration of
multiple biological databases is their inherent
heterogeneity. These inherent heterogeneities are
caused by:
1 Heterogeneity of Content: Different databases
are used to store a variety of information; for
example, protein sequence information is
available through Swiss-Prot while protein
structure information is available through the
Protein Data Bank (PDB).
2 Heterogeneity of Database Management
System: Different data types require different
database management systems. For example,
table structured data can be easily stored
through relational database management
systems (such as those of Oracle), storage of

text data as in Genbank is not amenable to
relational DMBSs, while images require object
database management systems such as that of
OPM (Chen & Markowitz, 1995).
3 Heterogeneity of Data Model: The data model
(the schema and the query language) of
heterogeneous systems will also vary. For
example, relational systems mostly use SQL .
(structured query language), while special
query languages such as OQL (object query
language) need to be used for object databases
like the PDB.
In integrating molecular biology databases, the
following issues need to be addressed:
Integration of Data
==> A basic problem underlying the integration of
heterogeneous databases is the autonomy of the
sources, which has led to lack of cooperation
and non-standardization of formats with some
notable exceptions. For example, Genbank,
EMBL and the DNA Data Bank of Japan
(DDBJ) cooperate in creating a centralized
repository for the human genome sequence
data and daily exchange of data is made for the
purpose of synchronization. A cooperation!colicensing agreement is the first step in the
creation of integrated systems.
==> As data is exchanged between heterogeneous
systems, schema converters are required
(which convert data from one schema to either
the schema of another database or a global
schema).
General
schema
integration
methodologies have been discussed (Batini et
al, 1986) and further evaluated in the context
of biological databases by (Buneman et al,
1995).
==> Data conflicts and errors need to be resolved in
a systematic manner.
Integration of User Interfaces
==> Browsing Interface: Presenting a unified view
of the data is done in one of two possible ways
(Markowitz & Ritter, 1995): (a) A global
schema is created by unifying the schema of
the component databases; or (b) Local views of
the data which use the "local" schema of the
component databases.
==> Query Interface: Each of the component
databases may support different types of
ADAK & SRIV AST A V A: BIOINFORMATICS-MOLECULAR BIOLOGY DATABASES
queries (e.g. free-text search, keyword search,

accession number search, etc.). Integrated
querying of multiple databases is still the holy
grail of heterogeneous database management
systems today. The main hurdles to integrated
querying remain: (1) rewriting the queries and
query rules in the integrated system to
operations on the source databases; (2) ability
to implement join across heterogeneous items
from the source databases; (3) the ability to
identify redundancy; and (4) the ability to
exploit the parallelism of different servers
(computers) in use for the source databases.
Integration of Visualization and Analytical Tools

=> With integration comes the advantage that
bioinformatics applications (analytical and
visualization tools) can be developed on a
common data model.
There are currently three types of integration
strategies being used for molecular biology databases:
data warehousing, link-driven federation, and
semantic integration. These strategies along with
discussion of the integrated molecular biology
database systems that have resulted from these efforts
are reviewed as under.
Data Warehousing
Data Warehouses represent the materialization of a
global schema, i.e. the integrated database is defined
by the global schema and loaded with data from the
component databases. The steps involved in creating a
data warehouse are:
Downloading of data from the component
databases
Data cleaning (removal of erroneous entries and
resolving data cOl)flicts)
Reformatting the data into the global schema
A data warehouse is often confused with a
consolidation of mUltiple databases. In consolidating
multiple databases, the component databases are
subsumed into a larger database and the individual
component databases are discarded, whereas in data
warehousing, the component databases are not
disturbed. Consolidation is far more complex and
expensive, requiring consensus on common names,
data structures, and policies. Furthermore, existing
applications on component databases must be
converted in order to function on the consolidated
system. The relative advantages and disadvantages of
data warehousing systems are listed in Table 3.
Current data warehousing systems used in molecular
biology databases are:
111
Table 3-Data Warehousing
Advantages
Disadvantages
Downloaded source data

can be manipulated into
suitable formats
Global schema allows
browsing of data through a
unified view
Execution of queries is
usuall y very fast because
all data is locally available
System is reliable because
there
is
no
outside
dependency.
GUS-Genomics
Unified
High maintenance cost

as data needs to be
constantly
synchronized
with
component databases
Large
initial
costs
associated with setup
and
schema
development
Storage requirements
add to cost
System does not scale
easi ly - not easy to
add new databases.
Schema-Genomics
Unified Schema (GUS) (Davidson et ai, 2001) is a

warehousing based data integration systems from
University of Pennsylvania. GUS uses a relational
data model and stores nucleotide, amino acid
sequences and annotations in Tables. The data sources
already included are GenBanklEMBLIDDBJ, dbEST
and SWISS-PROT.
GUS builds and maintains a map between DNA
sequence based entries at some sites and gene-based
entries at others through its local storage of the
necessary date. Its tables hold the conceptual entities
that DNA sequences and annotations indirectly
represent, which are genes, the RNA derived from
these genes and the proteins from those RNAs. While
transforming the data into gene-centric organization,
it cleans data to identify erroneous annotations and
misidentified sequences.
GUS facilitates data maintenance by tracking how
data is generated/accessed from ' sources and
subsequently modified. This helps in learning about
continuous changes (external or internal) to the
kno:vledge of genes that data items represent. The
revision s of original data can be by the source itself,
an-notations are slowly experimentally verified, and
predicted values become more accurate with better
algorithms. For example, with computationally
derived annotations, GUS stores the algorithm used,
its implementation version, input parameters and the
run time information.
To keep its database synchronized with external
data sources after initial download, GUS retrieves
updates and new entries from them based on the
source's change schedule or periodically. The
changed fields of the modified entries are detected
112
from the database based on a difference operation,

and updated accordingly. Both the new and updated
entries are subjected to an annotation update process
in which the protein and DNA sequence and their
annotation are transformed into gene and protein
based entries. The user always sees a production
version of the database while updates are made to the
next development version. When the development
cycle is over, the database version is put into
production, and a new development version is created.
Table 4-Link-Driven Federati on

AdvalJlages
Disadvan tages
Point-and-click links make

it conveni ent for th e user
to see re lated so urces
Links to a variety of
informati on for each entry
in a database
Link Driven Federation

The link driven federation approach has been used
successfully by mainly online molecular biology
databases to add value to their data. A link connects
an object in one database to objects in possibly
another database. Links can also be to objects in the
same database (for example, the related documents
links that appear with entries in the PubMed
database). The links provided allow the user to start
from a data item of interest in a particular database
and then jump to other related data sources through
the links. The user has to still interact with individual
sources; only the interaction is easier through
convenient links and invoking/querying the individual
source databases directly is not required. The most
widely used integrated molecular biology systems,
Entrez, SRS (Etzold & Argos, 1993), and LinkDB
(Fujibuchi et at, 1998) are examples of this approach
(Table 4) .
Entrez. The National Center for Biotechnology
Information (NCBI), which is part of the National
Institutes of Health, USA is the foremost repository of
publicly available genomic and proteomic data. Their
integrated information retrieval system, known as
Entrez, is perhaps the most utilized of all biological
database systems. Entrez uses a link-based approach
to cross-reference entries from different databases.
The nucleotide sequence database Genbank, the
medical literature database PubMed, NCBI's protein
sequence database, NCBI's protein structure database
and NCBI's database of whole genomes (Fig. 6).
Hard links are applied between the different
databases, whenever there is a logical connection
between entries. LinkOut from PubMed citations also
provide links to user defined external web pages (for
example, the full text journal articles, biological data,
sequence centres, etc.) These external resources
provide a URL, resource name, brief description of
their web site, which PubMed uses to create the
links to their sites. For complete review of the
Manual link creation

difficu lt for large
databases
Does not scale easily:
for each entry in a
database difficult to
generate all links to a
new databases
Changes in so urce
database schema may
result in links
becoming obsolete
features and complexities of Entrez, readers may

refer to a tutorial on the Entrez system at
http://www.ncbi.nll11 . n ih.gov :80/entrez/q uery/static/he
Ip/helpdoc. html.
SRS. The creators of the Swiss-Prot database at the
Swiss Institute of Bioinformatics and the European
Bioinformatics Institute have created SRS (~equence
Retrieval ~y s tem). SRS allows retrieval from an
extensive catalogue of more th an 75 public biological
databases of interest. The link button in SRS allows
the user to obtain all the entri es in one databank that
are linked to an entry (or en tries) in another databank.
Hyperlinks are links between entries, which are
display ed as hypertext (clicking on the text takes you
to the related entry). These are hard coded into SRS
and are useful for examining entries that are
referenced directly from a data item of interest.
To see what data is linked to an entry, ticking the
checkbox next to that entry and clicking the link
button will disp lay the LINK page (Fig. 7). After the
user has selected the database to be linked with the
entry (say Swiss-Prot), th e user clicks the submit link
button. The resu lt will be a list of all the SWISS-
OM 1M
I_I
r-----
I~ Full-tex~
r-~-' //
<~urnaI5
-1.-,r :"1
r--M-a-p-'-s""'-'-&--'
Genomes
ElectrOniC
- - - - - - -I=~
' ':m::..t-.1.. - ....-::.::~-_<_:.-_-.._
...:...t_ _- ,
'--____,_..;...- l ---ITa xonom y I

Fig. 6---E ntrez map
Qur#lfJ qwry:
113
'{EMBirlD:AB034639]"
!i'1nd oU J!1IJm~
iII .rhe seltettd databw whWb are linked to 1M curren! ({lltry

ill the wnent 4UC1Y . h are liIlked w a! se1emd databanks
ill Ibe wrrent quay wflitb are IIOllinked t.O itfl'j of Ihe w etted databanks
r TO' PQr~711 Library

9
ReCenm.~es
,..., r
MEDUNE;
gQ
hfore~~; - mbsIWm
~
MEDUNE (Main Rdease)
ldEDI-INE <yp~)
Sequence libraries
:!!!. r ~
r SWALqSPTR) r !L
r
r
PATENT PRT
JPO PRT
PATENT DNA
RemTrEMBL
USPO PRT
r
r
ENSE-\tB1IMGTlLIGM-DB
IMGTHLA
Sequence hbrarie& - subsecttom

:::3 r EMSL (Rdtase) r .\1BL (Uodates)
r TrEMBL (Updat<!s)
InwiPro&Related
SeqRelated
lransFac
.~~~~
r
r
SWISSPROT
SpTfEMBL
.~----------.----~------~-
Fig. 7-5RS link page
PROT entries that are related to the EMBL entries

with which we started. These will be displayed on the
Query Result page.
LinkDB. The integrated database retrieval system
DBGETlLinkDB (Fujibuchi et ai, 1998) is the
backbone of the Japanese GenomeNet service.
DBGET i~ used to search and extract entries from a
wide range of molecular biology databases (Fig. 8),
while LinkDB is used to search and compute links
between entries in different databases. Once an entry
is retrieved through DBGET, all links from this entry
can be obtained by clicking on the entry name, which
causes the search against LinkDB. In addition to the
original links provided by the source database which
are embedded in the entry, LinkDB also aims at
providing computer-generated links, which include:
Factual Links: links between database entries,
e.g., Medline ID and GenBank accession
Similarity Links: links produced by similarity
search, e.g., the results of BLAST and FAST A
DBG ET Database Links

KEGG
,. .-. --.::.-.OWlY
LlTOO
Fig. 8--DBGET database links
INDIAN J BlOTECHNOL, JANUARY 2002
114
Biological Links: links by biological meanings,

e.g., molecular or genetic interactions in the
KEGG pathways.
Semantic Integration
The ultimate aim of the LinkDB systems is to add
to its current hyperlinks using biological meanings
and relations: For example, known protein-protein
interactions should be represented through a bidirectional hyperlink between the two protein
sequence entries. However, this kind of semantic
integration is still in its infancy and the focus of much
bioinformatics research. An effort on semantic
integration worth mentioning is the development of
XML ontologies and some early work on TAMBIS.
XML Ontologies. In a discipline-wide effort to
standardize the representation of entries in biological
databases. various organizations and institutions are
participating in the creating of XML ontologies.
XML: Better than html (the language used in
creation of web pages), XML (eXtensible
Markup Language) has emerged as the de facto
format for exchanging data in the last few

years with abundant development tools and
backing from major vendors. XML is ideal for
format integration because it can handle semistructured data (unlike the rigid data structures
of relational database systems). The wide use
of XML is easily seen in the variety of
standards being proposed, which are shown in
Table 5.
Ontology: An ontology is a set of concepts
(objects, events, and relations) that are
specified in order to create an agreed-upon
vocabulary for exchanging information.
Specification of ontology involves (1)
determination of which concepts are to be
included in the ontology; (2) assigning English
language meaning to the terms; and (3)
defining all possible relations between the
concepts in the ontology .
TAMBIS.
Transparent Access
to Multiple
Bioinformatics Information Sources (T AMBIS) is an
integration system for molecular biology where a
Table 5- XML Standards in Biology

XML Standard
Description
AnatML
a language for storing geometric information and docu mentation

obtained as part of the musculoskel etal modelling project
For exchanging and storing data from microarray experiments
a public domain standard fo r encoding and display of DNA,
RNA and protein seq uence information
BlOML standard a llows the full specification of all experimental
information known about molecular entities composed of biopolymers
(for example, pro tei ns and genes)
The goals of GAME, at least in the perspective of the bioxml
community, are to provide an XML ontology and tools for annotating
biosequence "annotation features"
For storage and exchange of computer-based biological models
FDA safety domain and metadata models with an XML ontology for
clinical data
An open-standard XML format for microarray and gene expression data
The Gene Ontology(tm ) Consortium is attempting to produce a dynamic
controlled vocabulary th at can be app lied to all eukaryotes, even as we
gain more knowl edge of gene and protein rol es in cells
Part of the GeneX project, a massi vely distributed ge ne expression
database
MoDL (pronounced "Model") is an XML application that allows
chemical simulation data visualization over the Web
represent models of bio logical systems common in research on a number
of topics includin g cell signaling pathways, metaboli c pathways,
biochemical reactions, etc .
Taxonomic ML seeks to standardize (I) the description of the structure
(topology) of a biological phylogeny; (2) the presen tation of statisti cal
metadata about the phyl ogeny and (3) the option of superimposing a
Linnaean taxonomy
XML file fo rma t derived from DELTA (DEsc ripti on Lang uage for
T Axonomy) standurd)
Array XML (AXML)

Bioinformatic Sequence Markup Language (BSML)
Biopolymer Markup Language (BIOML)
Genome Annotation Markup Elements (GAME)
CellML
Clinical Trial Data Model
Gene Expression Markup Language (GEML)
GeneOntology Markup Language
GeneX Gene Expression Markup Language (GeneXML)

Molecular Dynamics Markup Language (MoDL)
Systems Biology Markup Language (SBML)
Taxonomic Markup Language (TaxonomicM L)
XML Description Language for Taxonomy (XDELT A)
ADAK & SRIV ASTA VA: BIOINFORMATICS-MOLECULAR BIOLOGY DATABASES
domain-dependent ontology is used for information

retrieval. The ontology is central to the system and
plays a role in query formulation and execution
(Goble et at, 2001).
Discussion
The beginning of the new millennium coincided
with the dawn of a new era in biology. The next
century is expected to see remarkable discoveries in
the field of biology, and biotechnology is all set to
play a major role in revolutionizing the ways of
medical treatment and solves many of life's
mysteries. Our interaction with environment will also
be guided by biotechnology. This review paper
intends to provide only a flavour of the scope and
resources available today. Weare still at the
beginning of using genomics and proteomics in
Biotechnology. Integration of data (by the biologist or
by the computer scientist in creating access to
heterogeneous systems) will be one more important
step in furthering biotechnology research. Current
research in Bioinformatics as well as software
development efforts in bioinformatics, as it pertains to
molecular biology databases, is focused on
integration.
The key to biotechnology discoveries is locked in
the genomes of organisms and bioinformatics holds
the key to unlock this data for the next generation of
innovations.
References:
Altschul S F, Gish W, Miller W, Myers E W & Lipman D J, 1990.
Basic local alignment search tool. J Mol Bioi, 215, 403-410.
Arabidopsis Genome Initiative. 2000. Analysis of the genome
sequence of the flowering plant Arabidopsis thaliana . Nature
(Lond), 408, 796-815.
Bankier A T, Weston K M & Barrel B G, 1987. Random cloning
and sequencing by the M13/dedeoxynucleotide chain
termination method. Methods Enzymol, 155, 51-93 .
Batini C, Lenzerini M & Navathe S, ] 986. A comparative analysis
of methodologies for database schema integration. ACM
Comput Surv, 18. 323-364.
Buneman P, Davidson S, Hart K, Overton C & Wong L, 1995. A
data transformation system for biological data sources. in
Proceedings of the 21st International Conference on Very
Large Databases. Pp 158-169.
Baxevanis A D & Francis Ouellette B F (Eds), 1998.
Bioinformatics : A Practical Guide to the Analysis of Genes
and Proteins. John Wiley & Sons, New York.
Chen I A & Markowitz V M, 1995. An overview of the objectprotocol model (OPM) and OPM data management tools. In!
Syst, 20, 393-4]8.
Davidson S, Overton C & Buneman P, 1995. Challenges in
integrating biological data sources. J Computational Bioi, 2,
557-572.
115
Davidson S, Crabtree J, Brunk B, Schug J, Tannen V, Overton C

& Stoeckert C, 2001. K2IKleisli and GUS: Experiments in
integrated access to genomic data sources. IBM Syst J, 40.
5]2-531.
Durbin R & Thierry-Mieg J, 1992. Syntactic definition for the
Ace DB
database
manager.
Available
at
htt p://probe.nalusda.gov:8000/acedocs.
Durbin R, Krogh A, Mitchison G & Eddy S, 1999. Biological
Sequence Analysis: Probabilistic Models of Proteins and
Nucleic Acids. Cambridge University Press, Cambridge. UK.
Etzold T & Argos P, 1993. SRS: An indexing and retrieval tool
for at file data libraries. Comput Appl Biosci, 9, 49-57.
Fujibuchi W, Goto S, Migimatsu H, Uchiyama I, Ogiwara A,
Akiyama Y & Kanehisa M, 1998. DBGETlLinkDB: an
integrated database retrieval system. in Pacific Symposium
on Biocomputing. Pp. 683-694.
Gibas C & Jambeck P, 2001. Developing Bioinformatics
Computer Skills. O'Reilly & Associates, New York.
Gibrat J-F, Madej T & Bryant S H, 1996. Surprising similarities in
structure comparison. Curr Opinion Struct Bioi, 6. 377-385.
Gibson T J, Thompson J D, & Higgins D G, 1994. CLUSTAL W:
improving the sensitivity of progressive mUltiple sequence
alignment through sequence weighting position-specific gap
penalties and weight matrix choice. Nucleic Acids Res. 22,
4673-4680.
Goble C, Stevens R, Ng G, Bechhofer S, Paton N, Baker P, Peim
M & Brass A, 2001. Transparent access to multiple
Bioinformatics information sources. IBM Syst J, 40, 532-551.
Gura T, 2000. Reaping the plant gene harvest. Science, 287, 4124]4.
Gusfield D, 1997. Algorithms on Strings, Trees, and Sequences:
Computer Science and Computational Biology . Cambridge
University Press, Cambridge, UK.
Henikoff S & Henikoff J G, 1992. Amino acid substitution
matrices from protein blocks. Proc Natl Acad Sci USA , 89,
10915- 10919.
Karp P, 1995. A strategy for database interoperation. J
Computational Bioi, 2, .573-583.
Markowitz V M & RiUer 0, 1995. Characterizing heterogeneous
molecular biology database systems. } Computational Bioi,
2,547-556.
Melin A M, Perromat A & Deleris G, 2001. Sensitivity of
Deillococcus radiodurans to gamma-irradiation: A novel
approach by fourier transform infrared spectroscopy. Arch
Biochem Biophys, 394, 265-274.
Misener S & Krawetz S A, 2000. Bioinformatics: Methods and
Protocols. Humana Press, New Jersey.
Mount D, 2001. Bioinformatics: Sequence and Genome Analysis.
Mount, D. Cold Spring Harbor Laboratory Press, Cold
Spring Harbor, New York.
Needleman S B & Wunsch C D, 1970. A general method
applicable to the search for si milarities in the amino acid
seq uences of two proteins. J Mol Bioi, 48, 443-453.
Pearson W R & Lipman D J, 1988. Improved tools for biological
sequence analysis. Proc Natl Acad Sci USA, 85, 2444-2448.
Rashidi H H & Buehler L K, 1999. Bioinformatics Basics
Applications in Biological Science and Medicine. CRC
Press, Boca Raten, Florida.
Richardson J S & Rich ardson D C, 1989. Principles and paUerns
of protein conformation. in Prediction of protein structure
and the principles of protein conformation, edited by. G. D.
Fasman. Plenum Press. New York. Pp 1-98.
116
Sayle R A & E.J. Milner-White E J, 1995. RasMol: Biomolecul ar

graphics for all . Trends Biochem Sci, 20, 374-376.
Schwart R M & Dayhoff M 0, 1978. Matrices for detecting
di stant relation-ships. in Atlas of Protein Sequence and
Structure, S (suppl. 3), 353-358.
Smith T F & Waterman M S, 1981. Identification of common
molecular sub-sequences. J Mol Bioi, 147, 195-197.
Waterston R & Sulston J, 1995. The C. Elegans genome
sequencing project. Proc Natl Acad Sci USA, 92, 1083610840.
Weigel D, Ahn J H, Blazquez, Borevitz, J 0 et 01 2000.

Activation tagging in Arabic/opsis. PiaIII Physiol, 122, 10041013.
White 0 el ai, 1999. Genome sequence of the radioresistant
bacterium Deinococcus radiodurClns R I. Science, 286, 15711577.
Zuo J & Chua N H, 2000. Chemical -inducible systems for
regulated expression of plan t genes. ClIrr Opin Biotechnol,
11. 146-15 I.

Ijbt 1 (1) 101-116

Cargado por

Información del documento

Descripción original:

Título original

Derechos de autor

Formatos disponibles

Compartir este documento

Compartir o incrustar documentos

Opciones para compartir

¿Le pareció útil este documento?

¿Este contenido es inapropiado?

Copyright:

Formatos disponibles

Ijbt 1 (1) 101-116

Cargado por

Copyright:

Formatos disponibles

Indian Journal of Biotechnology

Vol I, January 2002, pp 101-116

Bioinformatics: Advancing Biotechnology through Information Technology Part I:

*Author for correspondence:

has been touted as in-silico biology, where wet lab

INDIAN J BIOTECHNOL, JANUARY 2002

(2001); and Misener & Krawetz (2000) .This paper

on every industry that uses biotechnology

Use Case Scenario-A scientist at a plant

Purifv DNA of int('l'(~ land

JOC Jem i e.;]

,..~. .~.. ~-'~:...

Fig. l---Shotgun sequencing

Fig. 2-From strawberry ge nome to phenotype (colour: red and

ADAK & SRIV ASTA V A: BIOINFORMATICS-MOLECULAR BIOLOGY DATABASES

Table I-Major Biological Databases

Molecular Biology Database Catalogs

ht tp://www .in fobiogc n.fr/scrv ices/dbcati

Major Biomedical Literature Databases

http://www.ncbi.n lm.nih. gov/PubMcd

Medline and Pre-Medline Citations

Major Nucleotide Sequence Databases

EMBL Nucleotide Sequence

hltp://www.ebi .ac.uk/e mbl{

All known nucleotide and protein sequences:

Major Protein Structure Databases

http://www . ncbi. n Im.ni h. goy/Structure

http://scop.mrc-I mb.cam.ac .uklscop/

Structure data determined by X-ray crystallography

Major Mutation Databases

http://www .ncbi .nlm .nih .gov/SNP/

Databa se of single nucleotide polymorphisms

Intragen ic sequence polymorphisms

Major Gene Express ion Databases

NCBI's Repository for gene expressio n (under

Major Plant Genome Databases

Genomics for plant improvement

Comprehensive gateway to crop genomes

INDIAN J BIOTECHNOL, JANUARY 2002

Table I-Major Biological Data bases--Colltd.

Major Microbial Genome Databases

Completed Microbi al Genomes

Major Organism-speci fic Genome Databases

Information regarding complete and ongoing genome

http://www. tigr. org/ldb/e2k l /ath l/

Arabidopsis thalian a genome database

S. cerevisiae genome information

databases such as Genbank or EMBL as well as the

Reportin g current data in the rice genome project

value to the underlying data by virtue of

ADAK & SRIV ASTA V A: BIOINFORMATICS-MOLECULAR BIOLOGY DATABASES

genes and "reaping the plant gene harvest" by using

UK CropNet-Cbmparative genomics (ascribing

CropScience. Bioplante). Genoplante is part of the

NCBI's Microbial Genomes Gateway-The gateway

INDIAN J BIOTECHNOL, JANUARY 2002

microbial genomes, including sequence and structural

addition to the links to microbial genomes (complete,

together in an unprecedented sc ientific collaboration

Chimeraplasty: Creates SNPs in plant genes

3. Applications for Molecular Biology Databases

Text-based Database Searching