Lect. 7 & 8

Lecture 7
Sequencing and assembling genomes

WHAT GENOME SEQUENCES CAN TELL US
1) All heriditary information is encoded in the sequence of bases in DNA.
2) Genomes vary from 600 Kb to > 3,000 Mb but can all be sequenced
3) Genes are units of transcription. Almost all code for proteins
4) Since there is a universal 3-letter translation code, the amino acid
sequence of a protein can be determined from the nucleotide sequence
of the gene. Comparison to other proteins give useful hints to function.
This is where you need to understand protein evolution.
5)Knowing which proteins are encoded in the genome of an organism
helps us understand what it can and cannot do.
6) But it does NOT tell us what it does.
DNA is just a string of 4 bases - A,T,G,C - but a very long string!
The human genome has about 3,000 Mb carried on
22 chromosomes plus an X and a Y.
It has been completely sequenced and annotated.
How is it done?
Bob Waterston John Sulston Craig Venter
Genomic sequencing is an industrial, high-throughput process
(not to be carried out in an academic laboratory - Craig Venter)
Shot-gun sequencing is the way to go.
Sequence a lot of short fragments and

assemble them on the basis of overlap.
DETERMINING A SEQUENCE
A low proportion of dideoxynucleotide

triphosphates terminates the copies made
by DNA polymerase.
ddATP terminates where A is coded,
ddGTP terminates where G is coded.
etc.
The fragments are separated by length
and the 4 bases are read on the basis
of the dye they incorporate.
Gel electrophoresis can separate

a fragment of 500 from one of 501 bases,
but seldom can separate a fragment of
800 bases from one of 801.
So “reads” are usually only 500 bp long.
It takes a thousand reads to sequence a

fragment of 100 Kb because you need
at least 5X redundancy. It takes millions
of reads to cover a 100 Mb genome.
Redundancy improves accuracy and generates overlaps
Assembly on the basis of sequence identity in region of overlap (PHRAP)
Paired reads and BAC end sequencing establishes overlap and gaps
reads (500 bp)
contigs (5 Kb)
metacontigs (50 Kb)
BACs (200 Kb)
markers
chromosomes
Shear DNA into fragments of ~2 Kb.
Ligate into a plasmid.
Transform E. coli with plasmid.
Pick thousands of individual clones
ROBOTICALLY.
Store in 96-well plates.
Since random inserts are sequenced,

statistically 7 fold redundancy
is needed to cover >99.9% of the DNA.
Use cost-effective universal primers that re-

cognize flanking plasmid sequences to
generate dye-marked fragments terminated by
incorporation of dideoxy-nucleotide.
Separate fragments electrophoretically

in an automatic sequencer. The manufacturer’s
computer program then calls the bases.
A program such as PHRED can establish

confidence levels.
ASSEMBLY AND FINISHING Paired reads can establish
gap size if the average insert length
of cloned fragments is known.
Clones can then be sequenced
to fill the gaps.
PCR fragments up to several Kb
covering the gaps can be
generated and sequenced.
Errors in assembly can be recognized

when PCR fails to generate
fragments of expected size.
End sequencing of large insert

clones carried in BACs or YACs
can generate metacontigs.
Shot-gun sequencing of a 200 Kb
insert that covers a gap can fill the
gap.
This shot-gun approach works well for segments up to 50 Kb
but is more problematic for large insert clones with >200Kb
because incorrect assembly can result from low-information
regions and repetitive elements.
There are various programs that attack the assembly problem

such as PHRAP and EULER. They all benefit from having
>7 fold depth of coverage to reduce errors in the
finished sequence. Error-free sequences can often be
uniquely assembled but can benefit from independent
mapping data.
At the chromosome level, assembly is a mapping problem.

Sequencing of model organisms with small genomes led the way.
Organism Genome Size Number of Genes
bacteria 1 to 5 Mb 1,000 to 3,000

yeast 12 Mb 6,000
Dictyostelium 34 Mb 12,000
C.elegans 100 Mb 18,000

120 Mb 14,000
Drosophila
human 3,000 Mb 25,000
Physical (sequence based) maps were generated in different

ways in each of these organisms.
Large insert YACs were screened for physically
mapped markers and a tiling set chosen that covers
each chromosome
DIRS dhkA vatM manA myoM gluA rasD pabA vsgB
C6
3245 3631 3969 3238 3015 3693 3597 3818 3260 4004 3234 3167 3400 3490 3906 3160 3776
3197 3582 3081 3100 3235 3669 3574 3873 3254 3097 3307 3609 3789 3883 3083
3034 3202 3438 3241 3961 3957 3959 4005 3749 3649 3850 3960 3112 3030
3022 3470 3453 3471 3219 3742 3489 3322 3037 3002 3567 3696 3052 4007 3479
3331 3561 3127 3372 3884 3180 3633 3142 3817 3053 3126
3200 3689 3350 3159 3718
Overlapping YACs formed scaffolds for assembly of

sequence-based contigs and confirmed the legacy map.
Position and order was verified by HAPPY mapping
Summary of genome sequencing methodology
1) Sequence 500 bp from each end of fragments cloned

in a plasmid using primers that start within the plasmid
sequence.
2) Assemble contigs on the basis of sequence overlap.
3) Use paired-reads to recognize gaps.
4) End sequence large inserts (~200 kb) carried in BACs

or YACs to generate a scaffold.
5) Position scaffolds on chromosomes using physically

mapped markers. Fill the gaps.
6) Declare the sequence done!

GENES
The sequence of DNA (the genotype) is replicated at each cell division,
but it is the phenotype that matters.
Genes make proteins and the phenotype is determined by the proteins

that accumulate in different cell types.
But only 2% of human DNA encodes proteins. Genes are hard to find.
The Proteome is the complete repetoire of proteins that a species

can make.
Genes can be recognized in the DNA sequence on the basis of coding

potential.
Genes can also be recognized from their transcribed mRNA sequences.
A higher percentage of the DNA encodes proteins in organisms

with smaller genomes.
Sequencing of model organisms with small genomes led the way.
Organism Genome Size Number of Genes
bacteria 1 to 5 Mb 1,000 to 3,000

yeast 12 Mb 6,000
Dictyostelium 34 Mb 12,000
C.elegans 100 Mb 18,000

120 Mb 14,000
Drosophila
human 3,000 Mb 25,000
Physical (sequence based) maps were generated in different

ways in each of these organisms.
start of ORF (ATG)
>JC1b01d12.r1 contig
AAAAAAAAAA AAAC GATATT TGTTAAATTT CAACTTTCAA ATAAT G ACA G AACCTGTT G C
xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxUUU U U U UUU C C C C C C C CCC C C C C C C C
TGCAC CAAAA AAAAA GATTG TTTTAAAG A G AGCA G CT G GT AGTAGTTCAT CCAAT GAATT

CCCC C C C C C C CCC C C C C C C C CCC C C C C C C C CCC C C C C C C C CCC C C C C C C C CCC C C C C C C C
splice site
TAAAATT GAA TCAATT GATA AAACTTTTG G TAATTATTAT TATTATTATT TAAATTTATT
CCCC C C C C C C CCC C C C C C C C CCC C C C C C C1 1111111111 1111111111 1111111111
HMMgene Prediction ATTTAAAA G A AAAAATAAAA ATGTTTAACT TTTTTTTTTT TTTTTTTTTA GAATTAC CAA

splice site
1111111111 1111111111 1111111111 1111111111 1111111111 1CCC C C C C C C

of ORFs ATCATTTAAA AAAA GTAAAT GAAAATTTTA ATAATAAATC AAGTACAATT TATAATGTAT
ATGAAAAACA AGCAACTG AT ATATTTACAA ATTG GA TAAA AGAAAAAA GA TATATCTTA G

translation termination [end of ORF]
ATGTTT G G TC TTAA G AATAA AAAAATAAAA AATACAAATA TGAATAATAA AATAAAAAT G
CCCC C C C C C C CCC Cuuuuuu uuuuuuuuuu uuuuuuuuuu uuuuuuuuuu uuuuuuxxxx
GCTTTATTTA ATTATTTTAA ATTTAATTTT CCCATTTGTT TTTGTAATTT CTTTTCTTC C

xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx
TTTTG G G C C G TTTTTTAATT TTTTTTTTTT TTTGT GATTT TTAATTTAAA AAAAAAAAAA

AAAAAAATAA ATAAATAAAA AAAAAA G AAT GTTTA G AATA ACAAAATTTA ATAAATATTA

TAATAAATTT AG GTCATTTA AAA GAAAAAA TATAATTTCC ATA

xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxx
numgf 19 >JC1b01d12.r1 pORF 272 -- 1295 strand f start y stop y
M S NKV G NSKN NK NKSIKFAPKHKDKSYDNE DFNAVSKKSSISVSDLPTK GEEK H RIMALS

FPIKLS M W DFG Q C D SKK CTG R KLERL GYVKSINLTHKFKGIVLTPSAK Q SISPADR DIVQ
NL GVSVVD C S W AKVDSIPFG K M K G G H D R LLPFLIAANPVNY G KPFKLTCVEAVAACLFIT
GFTAE G H Q V LG GFK W G S S FYKVNK DLFEKYILCANS QEVV QIQNEFIAKCE Q D Q K D RAAN
IEYDEF GL QLNP N RILRTN N D D DEE N G D E DY C D D D DE D EDEE DEEED HE C DSE C D H DEEE
EED N DE
Classification of chromosome 2 encoded proteins
using GO terminology
Process
Function
Summary of methodolgy for recognizing genes
Experimental
Sequence cDNAs from a large number of mRNAs and compare to genomic sequence.
Computational
Train a HMM program to recognize start sites, exons, splice sites,
introns, and termination sites of ORFs. Predict genes.
Compare predicted proteins to legacy protein sequences.
Assign likely function to proteins.
Experimental
In situ hybridization to determine cell type expression.
Molecular genetics (knock outs etc) to determine function.
Lecture 8
Exon shuffling and Gene Loss
New proteins can arise by incorporating domains from

other proteins. This process is aided by exon shuffling
but exons do not define domains.
The genetic repetoire carried by the common ancestor

of plants, animals and fungi may have been larger than
what is found in any of these kingdoms now.
Specialized organisms shed genes.
When you have a whole genome sequenced, missing genes

can be recognized.
Common domains
Many large (> 200 aa) proteins have multiple, partially independent
domains. Some of these domains are found in various different proteins.
When organisms evolved a closed circulatory system about 400 Myrs ago,
there was a strong selection for clotting proteins to fill any accidental leaks.
Factor XII, a protein of 600 amino acids, is one of the clotting factors.
It "borrowed" several previously established domains.
EGF fibronectin I EGF kringle kallikein

(EGF domain is found in many (found in several (another clotting factor domain
extra-cellular proteins and receptors. clotting factors also found in peptidases.)
It is often involved in protein- as well as proteases.)
protein interactions.)
Exon shuffling
Exons in the 9 kb Factor XII gene
exon 4 exon 6 exon 7

x exon 5 x
exon 4 exon 5 exon 6 exon 7
Since introns are spliced out, length is not important.

The 5' exon/intron border [x/GT...] may be:
at the end of a codon (phase 0)
include the first base of a codon (phase 1)
or include 2 bases of a codon (phase 2).
Introns must begin and end in the same phase class [AG/yz].
Therefore, an inserted exon must have the same phase group
as the flanking exons. Many inserted exons are class 1.
Protein modules
Exon shuffling can not only add a new exon, but can also
duplicate existing exons or delete an exon.
The boundary amino acids encoded by an exon often do

not coincide with the boundaries of domains (contrary to
what some have proposed).
However, if a domain is encoded within an exon, it can

become a mobile module.
Gene loss
Genes can either arise in a specific lineage or be lost in a related lineage.

Evidence for gene loss requires that orthologs be present in "flanking" species
derived from a common ancestor.
1002 genes present in tomato were not found in Arabidopsis. 154 were
clearly present in either soy or Medicago. These are cases of gene loss in
the Arabidopsis lineage.
Some highly conserved genes that are present in both monocots and
dicots have been lost in Arabidopsis.
One of them, slr2032, appears to have come from the Synechocystis-like
genome that gave rise to the chloroplast.
History of gene slr2032
a diatom
a "primitive" alga algae
slr2032 found in
chloroplast genome
a blue-green bacterium
number of genes
in chloroplast genome
monocots slr2032 found in

nuclear genome
Arabidopsis
slr2032 missing from

both chloroplast and
nuclear genomes
The complement of ancient genes available to the common ancestor of the
crown organisms included genes with orthologs now in early diverging organisms
as well as either plants, fungi or animals. Likewise, genes with orthologs in both
a plant and a fungus or an animal were available to the common ancestor.
2,258 such genes have been recognized
Saccharomyces
Neurospora
Schizosaccharomyces
Homo Ciona
Fugu
Dictyostelium Drosophila
Oryza Anopheles
Arabidopsis
Caenorhabditis
Plasmodium
Leishmania
100 Darwins bacteria

Percent of ancient genes that have been retained
Dicty Dicty
0.5
1
20 2 9 3
45 56
Arab. Sacch. Arab. Dros.
7 3
15 1 20
12
Percent of ancient genes that have been lost or

highly modified since the plant/animal divergence
Dicty Dicty
20
12
7
1 15 3
3 2
Arab. Sacch. Arab. Dros.
3 9
2 20 0.5
1
Comparison of members of large families of related genes
in diverse organisms can uncover a history of gene loss
and domain loss.
The ABC family is one of the largest in eukaryotic genomes.

They encode half-transporters with one ABC domain and
full-transporters with two ABC domains. Many have
transmembrane domains.
The ABC superfamily of transporters all have related
ATP-binding cassettes. There are 8 families.
There are 68 ABC genes in the Dictyostelium genome.
Half-transporters
Full transporters
There are 11 ABCA genes in Dictyostelium.
Fungi have no genes of this family.
In animals ABCA proteins all have two transmembrane domains
(humans have 12 such genes)
In plants there is one gene with two domains and 16 with a single domain.
There appears to have been several cases of gene loss affecting whole
kingdoms.
The ABCG family is the only one in which the ABC cassette
preceeds the transmembrane domain. The progenetor may have
arisen by fusion of domains or domain loss.
Summary
Exon shuffling can facilitate insertion, deletion or duplication of a protein

domain.
Genes duplicate and diverge, but sometimes both copies

are lost because they are not needed in a new context (new species).
When a whole genome sequence is available, the LACK of a given
gene can be definitive.
Some highly conserved genes, such as slr2032, are missing in Arabidopsis.

What has taken over their functions?
Analyses of proteins that are members of superfamilies can uncover histories

of gene loss. In the ABCA family the last copy of a half-transporter was lost
between the time that Dictyostelium diverged and the time that fungi diverged.
The last copy of a full-transporter was lost in the line leading to fungi.
There are several ways by which domain order can be rearranged.

Lect. 7 &amp; 8

Cargado por

Información del documento

Descripción original:

Derechos de autor

Formatos disponibles

Compartir este documento

Compartir o incrustar documentos

Opciones para compartir

¿Le pareció útil este documento?

¿Este contenido es inapropiado?

Copyright:

Formatos disponibles

Lect. 7 &amp; 8

Cargado por

Copyright:

Formatos disponibles

Lecture 7

Sequencing and assembling genomes

1) All heriditary information is encoded in the sequence of bases in DNA.

Sequence a lot of short fragments and

A low proportion of dideoxynucleotide

Gel electrophoresis can separate

So “reads” are usually only 500 bp long.

It takes a thousand reads to sequence a

BACs (200 Kb)

Since random inserts are sequenced,

Use cost-effective universal primers that re-

Separate fragments electrophoretically

A program such as PHRED can establish

Errors in assembly can be recognized

End sequencing of large insert

There are various programs that attack the assembly problem

At the chromosome level, assembly is a mapping problem.

bacteria 1 to 5 Mb 1,000 to 3,000

C.elegans 100 Mb 18,000

Physical (sequence based) maps were generated in different

3200 3689 3350 3159 3718

Overlapping YACs formed scaffolds for assembly of

1) Sequence 500 bp from each end of fragments cloned

2) Assemble contigs on the basis of sequence overlap.

3) Use paired-reads to recognize gaps.

4) End sequence large inserts (~200 kb) carried in BACs

5) Position scaffolds on chromosomes using physically

6) Declare the sequence done!

Genes make proteins and the phenotype is determined by the proteins

The Proteome is the complete repetoire of proteins that a species

Genes can be recognized in the DNA sequence on the basis of coding

Genes can also be recognized from their transcribed mRNA sequences.

A higher percentage of the DNA encodes proteins in organisms

bacteria 1 to 5 Mb 1,000 to 3,000

C.elegans 100 Mb 18,000

Physical (sequence based) maps were generated in different

TGCAC CAAAA AAAAA GATTG TTTTAAAG A G AGCA G CT G GT AGTAGTTCAT CCAAT GAATT

HMMgene Prediction ATTTAAAA G A AAAAATAAAA ATGTTTAACT TTTTTTTTTT TTTTTTTTTA GAATTAC CAA

1111111111 1111111111 1111111111 1111111111 1111111111 1CCC C C C C C C

ATGAAAAACA AGCAACTG AT ATATTTACAA ATTG GA TAAA AGAAAAAA GA TATATCTTA G

GCTTTATTTA ATTATTTTAA ATTTAATTTT CCCATTTGTT TTTGTAATTT CTTTTCTTC C

TTTTG G G C C G TTTTTTAATT TTTTTTTTTT TTTGT GATTT TTAATTTAAA AAAAAAAAAA

AAAAAAATAA ATAAATAAAA AAAAAA G AAT GTTTA G AATA ACAAAATTTA ATAAATATTA

TAATAAATTT AG GTCATTTA AAA GAAAAAA TATAATTTCC ATA

numgf 19 >JC1b01d12.r1 pORF 272 -- 1295 strand f start y stop y

M S NKV G NSKN NK NKSIKFAPKHKDKSYDNE DFNAVSKKSSISVSDLPTK GEEK H RIMALS

Compare predicted proteins to legacy protein sequences.

Assign likely function to proteins.

Exon shuffling and Gene Loss

New proteins can arise by incorporating domains from

The genetic repetoire carried by the common ancestor

When you have a whole genome sequenced, missing genes

EGF fibronectin I EGF kringle kallikein

exon 4 exon 6 exon 7

exon 4 exon 5 exon 6 exon 7

Since introns are spliced out, length is not important.

The boundary amino acids encoded by an exon often do

However, if a domain is encoded within an exon, it can

Genes can either arise in a specific lineage or be lost in a related lineage.

monocots slr2032 found in

slr2032 missing from

100 Darwins bacteria

Percent of ancient genes that have been lost or

Lect. 7 & 8

Lect. 7 & 8