Documentos de Académico
Documentos de Profesional
Documentos de Cultura
1
00:00:00.000 --> 00:00:03.152
This lecture is about why we study
genomics and what it can teach us.
2
00:00:03.152 --> 00:00:08.150
So genomics is the study of
the genomes inside of us.
3
00:00:08.150 --> 00:00:10.600
Let's talk about human genomics.
4
00:00:10.600 --> 00:00:14.490
Everybody on the planet has a genome
that has governed their development and
5
00:00:14.490 --> 00:00:16.770
governs a lot of their biology, and
6
00:00:16.770 --> 00:00:21.030
as you can see by looking at any crowd
of people, we all look really different.
7
00:00:21.030 --> 00:00:24.620
However we've discovered through
sequencing in recent years
8
00:00:24.620 --> 00:00:28.590
that we're actually 99.9% identical or
even more than that.
9
00:00:28.590 --> 00:00:33.005
So it's really remarkable how
much diversity you can create
10
00:00:33.005 --> 00:00:38.020
from a very small number
of changes in your genome.
11
00:00:38.020 --> 00:00:42.484
But of course, now that we know that we're
99.9% identical, we still want to know,
12
00:00:42.484 --> 00:00:45.055
what is it that's driving
all these differences?
13
00:00:45.055 --> 00:00:47.110
Why is one person tall and
another person short?
14
00:00:47.110 --> 00:00:52.230
Why does one person live to be 100
an another person lives to be not 100?
15
00:00:52.230 --> 00:00:55.109
Why does one person get cancer and
another person not?
16
00:00:55.109 --> 00:00:59.084
Many of these things we suspect
are driven by our genomes, and
17
00:00:59.084 --> 00:01:00.925
we want to understand that.
18
00:01:00.925 --> 00:01:05.853
So, another thing, one of the most
basic things that our genome determines
19
00:01:05.853 --> 00:01:08.060
is how our bodies develop.
20
00:01:08.060 --> 00:01:11.709
We start off, as you all know, we start
off as a single cell which divides into
21
00:01:11.709 --> 00:01:15.362
a few apparently identical cells, but
that quickly divides into an embryo,
22
00:01:15.362 --> 00:01:17.408
and eventually grows into a whole person.
23
00:01:17.408 --> 00:01:20.904
And somehow that entire program of
development is encoded in our genome, and
24
00:01:20.904 --> 00:01:23.178
this is something that we don't yet
understand.
25
00:01:23.178 --> 00:01:24.015
In addition,
26
00:01:24.015 --> 00:01:28.620
the code in our cells determines all
the different cell types and for example,
27
00:01:28.620 --> 00:01:33.364
it determines how to make a neuron, which
is a very complicated cell, obviously
28
00:01:33.364 --> 00:01:38.210
a very different kind of cell from say a
skin cell, it does very different things.
29
00:01:38.210 --> 00:01:41.823
And yet the genome inside of a neuron
in your body is identical to the genome
30
00:01:41.823 --> 00:01:43.468
inside of any of your skin cells.
31
00:01:43.468 --> 00:01:47.296
So we want to understand what's going on
in that cell even though it has the same
32
00:01:47.296 --> 00:01:51.240
program, the same code, somehow it's
executing a different program to make it
33
00:01:51.240 --> 00:01:52.877
into a neuron versus a skin cell.
34
00:01:52.877 --> 00:01:56.750
Another big area of research
in genomics is cancer.
35
00:01:56.750 --> 00:02:01.510
So cancer is essentially
a genetic disease, we know now.
36
00:02:01.510 --> 00:02:05.220
Cancer cells are simply, again, cells in
your body that have the same genetic code,
37
00:02:05.220 --> 00:02:08.140
the same genome in them, but
somehow they've gone haywire,
38
00:02:08.140 --> 00:02:10.840
and they've started
replicating without control.
39
00:02:10.840 --> 00:02:12.560
That's what makes something cancerous.
40
00:02:12.560 --> 00:02:15.970
Basically it's cells that are dividing
without any check on their division.
41
00:02:15.970 --> 00:02:19.640
And, in fact, we define cancers by
the type of cell that started the cancer.
42
00:02:19.640 --> 00:02:23.700
So there's skin cancer, where a skin
cell starts dividing without control.
43
00:02:23.700 --> 00:02:25.370
It's also called melanoma.
44
00:02:25.370 --> 00:02:26.500
There's lung cancer.
45
00:02:26.500 --> 00:02:28.500
There's blood cancers
that are called leukemia.
46
00:02:28.500 --> 00:02:32.459
These are all defined by the cells
that started the cancer out and
47
00:02:32.459 --> 00:02:35.169
they all have a common phenotype, that is,
48
00:02:35.169 --> 00:02:39.658
they all have a common feature that
they're dividing without control.
49
00:02:39.658 --> 00:02:44.604
But the consequence of different cancers
are very different, and in fact,
50
00:02:44.604 --> 00:02:49.475
the mutations in our DNA that cause
these cells to become cancerous are also
51
00:02:49.475 --> 00:02:50.340
different.
52
00:02:52.530 --> 00:02:54.982
So what do our genes have
to do with any of this?
53
00:02:54.982 --> 00:02:56.040
So what I'm talking about,
54
00:02:56.040 --> 00:03:00.670
I just mentioned the word mutation,
a mutation is a change in your genome.
55
00:03:00.670 --> 00:03:04.780
And that can happen because
your DNA is damaged,
56
00:03:04.780 --> 00:03:07.060
it can happen because of
an accident in replication.
57
00:03:07.060 --> 00:03:10.055
So every time your cells divide,
to explain that latter point,
58
00:03:10.055 --> 00:03:13.670
every time your cells divide,
the entire genome has to be copied.
59
00:03:13.670 --> 00:03:16.190
And our cells are really,
really good at this, fortunately,
60
00:03:16.190 --> 00:03:17.720
otherwise we wouldn't exist.
61
00:03:17.720 --> 00:03:21.740
We wouldn't survive for very long, but
once in a while, they make an error,
62
00:03:21.740 --> 00:03:25.985
probably only one to three
errors per cell division.
63
00:03:25.985 --> 00:03:29.520
And once in a while, that error
causes something bad to happen, and
64
00:03:29.520 --> 00:03:33.300
we believe a lot of cancers are caused
by these sort of accidental errors.
65
00:03:33.300 --> 00:03:37.990
And understanding that is a matter
of understanding, well okay,
66
00:03:37.990 --> 00:03:42.390
my cell makes an error,
what does it mean for a mutation or
67
00:03:42.390 --> 00:03:46.160
an error in replication
to turn a cell cancerous?
68
00:03:46.160 --> 00:03:50.330
What usually we think happens is
that that mutation effects a gene
69
00:03:50.330 --> 00:03:53.500
which now doesn't function properly and
that gene, for example,
70
00:03:53.500 --> 00:03:56.950
that might be a gene that
controls cell division, and
71
00:03:56.950 --> 00:03:59.610
now you've sort of turned off
the check on cell division.
72
00:03:59.610 --> 00:04:02.050
And now the cell starts replicating
without control and you have a cancer.
73
00:04:02.050 --> 00:04:04.720
So that's the kind of thing we're
looking at when we're using genomics to
74
00:04:04.720 --> 00:04:05.570
study cancer.
75
00:04:06.690 --> 00:04:07.950
So how does this all work?
76
00:04:07.950 --> 00:04:10.790
So this program that I'm talking
about that's encoded in our DNA.
77
00:04:10.790 --> 00:04:14.420
Well there's something
called the central dogma.
78
00:04:14.420 --> 00:04:18.660
I didn't make that word up, that phrase
was created by Francis Crick and
79
00:04:18.660 --> 00:04:21.990
one of the co-discoverers of the structure
of DNA over fifty years ago.
80
00:04:23.070 --> 00:04:26.500
And it's now still used,
even though as with many dogma,
81
00:04:26.500 --> 00:04:27.770
it's not an absolute dogma.
82
00:04:27.770 --> 00:04:31.760
But the central dogma of biology,
or molecular biology,
83
00:04:31.760 --> 00:04:36.485
says that Information flows in
a single direction from your genome,
84
00:04:36.485 --> 00:04:39.262
that is your DNA, to RNA, to proteins.
85
00:04:39.262 --> 00:04:43.789
And the processes that govern
that we give different names.
86
00:04:43.789 --> 00:04:48.294
So the copying, when DNA is turned into
genes, the first step is you take pieces
87
00:04:48.294 --> 00:04:52.665
of it called exons, and you transcribe
them, that's the copying process,
88
00:04:52.665 --> 00:04:57.036
into RNA, and RNA is essentially an exact
copy of the DNA where all the letters
89
00:04:57.036 --> 00:05:00.399
are the same with the only
difference being the letter t, or
90
00:05:00.399 --> 00:05:03.350
thiamine becomes a letter u,
which is uracil.
91
00:05:03.350 --> 00:05:05.980
But otherwise it's
molecularly the same thing.
92
00:05:05.980 --> 00:05:08.110
That RNA then has to be
turned into a protein.
93
00:05:08.110 --> 00:05:11.970
Now, proteins are not comprised of
these four letters of nucleic acids.
94
00:05:11.970 --> 00:05:16.380
They're comprised of 20 letters that
are called the abbreviations for
95
00:05:16.380 --> 00:05:21.050
amino acids and proteins are also long
molecules, not nearly as long as DNA.
96
00:05:21.050 --> 00:05:23.960
A typical protein might be 300 or
400 amino acids long,
97
00:05:23.960 --> 00:05:26.940
and the way you get a protein
is you take a piece of RNA and
98
00:05:26.940 --> 00:05:31.980
you read it three letters at a time,
and each triplet encodes an amino acid.
99
00:05:31.980 --> 00:05:36.880
And if you think about it for a second
there's four possible RNA nucleotides.
100
00:05:36.880 --> 00:05:41.070
So there's four to the third,
or 64 possible combinations.
101
00:05:41.070 --> 00:05:45.480
Each of those 64 triplets each gets
102
00:05:45.480 --> 00:05:50.200
translated either into amino acid or not.
103
00:05:50.200 --> 00:05:52.220
There's three special
ones called stop codons.
104
00:05:52.220 --> 00:05:53.500
They indicate the end of a protein.
105
00:05:53.500 --> 00:05:58.140
So that's basically how DNA goes and
becomes a protein.
106
00:05:58.140 --> 00:06:00.540
And the proteins kind of do
all the work of your cells.
107
00:06:00.540 --> 00:06:04.650
So the proteins in your body
are what are actually doing most
108
00:06:04.650 --> 00:06:06.900
of the functional work of say,
metabolizing things,
109
00:06:06.900 --> 00:06:10.160
digesting your food,
moving things around in the cells.
110
00:06:10.160 --> 00:06:15.270
So that fundamental dogma has been around
for many decades now, and it more or less
111
00:06:15.270 --> 00:06:20.960
describes how information flows most of
the time from your genome to two proteins.
112
00:06:20.960 --> 00:06:22.940
However, that's not the whole picture,
we now know.
113
00:06:22.940 --> 00:06:27.690
So over time, we've learned that
information can flow the other way,
114
00:06:27.690 --> 00:06:31.280
and as scientists got more
familiar with the whole model,
115
00:06:31.280 --> 00:06:33.490
they realized that it had
to form the other way.
116
00:06:33.490 --> 00:06:36.530
As I was saying a little earlier in this
lecture, there are many different cell
117
00:06:36.530 --> 00:06:39.890
types in your body,
every cell has the same exact DNA.
118
00:06:39.890 --> 00:06:42.870
So if everything just flowed
from the DNA to the proteins,
119
00:06:42.870 --> 00:06:46.500
it would seem sort of fundamentally
impossible for the cells to
120
00:06:46.500 --> 00:06:50.120
behave differently, yet we know that
neurons don't act like skin cells.
121
00:06:50.120 --> 00:06:50.825
So what's going on?
122
00:06:50.825 --> 00:06:54.800
So the proteins themselves, some of
the proteins that are created by the DNA
123
00:06:54.800 --> 00:06:57.880
go back and bind to that DNA stuff and
modify it and
124
00:06:57.880 --> 00:07:00.250
change the genes that get turned on and
off.
125
00:07:00.250 --> 00:07:02.158
So proteins can self regulate in this way.
126
00:07:02.158 --> 00:07:05.820
And there are other things that can
happen with DNA, other modifiers,
127
00:07:05.820 --> 00:07:09.470
some are called methylation marks
that can change DNA as well.
128
00:07:09.470 --> 00:07:13.214
So there are features on the DNA that
are affected by the proteins themselves.
129
00:07:13.214 --> 00:07:17.692
So this feedback loops in the process
in this sort of information flow, and
x
130
00:07:17.692 --> 00:07:21.620
that as a result,
information's actually flowing backwards.
131
00:07:21.620 --> 00:07:22.861
So in the genomics field, so
132
00:07:22.861 --> 00:07:25.463
how do we make these measurements
that I'm talking about?
133
00:07:25.463 --> 00:07:29.277
How do we measure if you want to
understand cancer, then we have to go and
134
00:07:29.277 --> 00:07:33.233
get some cancer cells and figure out
what mutations happen in the cells.
135
00:07:33.233 --> 00:07:34.138
So how do we do that?
136
00:07:34.138 --> 00:07:35.800
Do that with sequencing.
137
00:07:35.800 --> 00:07:39.040
So sequencing is sort of at
the heart of genomics, and
138
00:07:39.040 --> 00:07:42.620
the genomics revolution that we've been
in for about the past 20 years, and
139
00:07:42.620 --> 00:07:46.150
this really accelerated
over the past ten years.
140
00:07:46.150 --> 00:07:50.220
And one reason for this acceleration
is that genome technology has gotten
141
00:07:50.220 --> 00:07:52.500
incredibly fast and efficient.
142
00:07:52.500 --> 00:07:55.420
So what you're looking at here are some
of the latest sequencing machines.
143
00:07:55.420 --> 00:07:58.800
A sequencer today, the highest super
sequencer we have today can sequence in
144
00:07:58.800 --> 00:08:04.120
a single run of the machine,
as many as a trillion nucleotides of DNA.
145
00:08:04.120 --> 00:08:08.240
So to give you a sense of what that means,
the Human Genome Project was started
146
00:08:08.240 --> 00:08:13.170
in 1989 with the goal of sequencing
one human genome in 15 years.
147
00:08:13.170 --> 00:08:16.430
It beat that goal, we actually
published the human genome in 2001, so
148
00:08:16.430 --> 00:08:19.290
in just 12 years we finished the project.
149
00:08:19.290 --> 00:08:21.810
I was part of that project.
150
00:08:21.810 --> 00:08:24.780
And it was a massive effort
involving thousands of scientists
151
00:08:24.780 --> 00:08:25.990
from around the world.
152
00:08:25.990 --> 00:08:27.480
And sequencers were employed at
153
00:08:28.620 --> 00:08:31.730
half a dozen huge genome
sequencing centers in the US, and
154
00:08:31.730 --> 00:08:36.590
large sequencing centers in the UK,
in France, in China, all over the world.
155
00:08:36.590 --> 00:08:40.210
Today you can get
a sequencer in a single lab,
156
00:08:40.210 --> 00:08:44.890
one of these machines run by a single
investigator, and in just a few days,
157
00:08:44.890 --> 00:08:49.150
you can sequence on the order of several
hundred human genome equivalents.
158
00:08:49.150 --> 00:08:53.360
So now we're in maybe a little more than
a dozen years after the completion of
159
00:08:53.360 --> 00:08:54.135
the human genome.
160
00:08:54.135 --> 00:08:57.140
12 year project involving
thousands of scientists.
161
00:08:57.140 --> 00:09:00.330
Now a single scientist in one day can
do far more sequencing than that entire
162
00:09:00.330 --> 00:09:01.630
consortium did.
163
00:09:01.630 --> 00:09:04.947
So that's allowed us to start looking
at things like cancer genomics.
164
00:09:04.947 --> 00:09:09.369
When the human genome was published in
2001, no one at that time thought it was
165
00:09:09.369 --> 00:09:13.922
even remotely feasible to start sequencing
the entire genome of a single tumor, and
166
00:09:13.922 --> 00:09:17.449
yet today, we have literally tens
of thousands of projects going
167
00:09:17.449 --> 00:09:19.659
on around the world doing exactly that.
168
00:09:19.659 --> 00:09:24.148
So the result of that is that we
are generating these enormous,
169
00:09:24.148 --> 00:09:26.400
enormous data sets.
170
00:09:26.400 --> 00:09:30.470
So sure we can sequence all that data,
but what I didn't say was that
171
00:09:30.470 --> 00:09:33.200
towards the end of the Human Genome
Project, when we were at the point where
172
00:09:33.200 --> 00:09:36.280
we were writing the paper, and I was part
of one of the teams that was doing that,
173
00:09:36.280 --> 00:09:39.850
we had hundreds of scientists frantically
trying to analyze all this data from
174
00:09:39.850 --> 00:09:43.620
a single genome and figure out what we
could say about it in a scientific paper.
175
00:09:43.620 --> 00:09:48.250
So today, one investigator, one lab,
can generate multiple genomes
176
00:09:48.250 --> 00:09:52.150
in a space of a week, but that doesn't
mean that in the space of a week, or
177
00:09:52.150 --> 00:09:55.000
a few days, you can analyze all that data,
not at all.
178
00:09:55.000 --> 00:09:59.390
So you need powerful computers running for
days or even weeks just to
179
00:09:59.390 --> 00:10:02.490
churn through the data and turn it into
something that a person can look at.
180
00:10:02.490 --> 00:10:04.610
And there's many different
questions you can ask about it.
181
00:10:04.610 --> 00:10:07.920
One question that I sort of already
alluded to is, you can ask well,
182
00:10:07.920 --> 00:10:11.220
what are the mutations in this cell
versus other cells from the same person?
183
00:10:11.220 --> 00:10:13.410
So that's say,
a kind of question you could ask.
184
00:10:13.410 --> 00:10:17.363
That requires significant amounts of
computing to take that bewildering massive
185
00:10:17.363 --> 00:10:20.685
data and turn into something
comprehensible to a group of scientists
186
00:10:20.685 --> 00:10:21.900
who can then analyze it.
187
00:10:21.900 --> 00:10:25.210
So another thing that's driven this
revolution is not just the efficiency but
188
00:10:25.210 --> 00:10:25.980
the cost.
189
00:10:25.980 --> 00:10:28.140
So the same that things are gotten faster,
190
00:10:28.140 --> 00:10:31.325
and more efficient that way,
they've also got much cheaper.
191
00:10:31.325 --> 00:10:36.171
So this plot that you're looking at now
shows you the rough cost per human genome
192
00:10:36.171 --> 00:10:40.673
equivalent going back to around the time
the human genome was completed.
193
00:10:40.673 --> 00:10:44.569
So when the human genome was finished
in 2001, the scientific community then
194
00:10:44.569 --> 00:10:48.463
proceeded with several other important
mammalian genomes that are about the same
195
00:10:48.463 --> 00:10:52.357
size, such as the mouse genome, and the
cow genome, and these are genomes that,
196
00:10:52.357 --> 00:10:55.771
like human, are around two and
a half to three billion base pairs long.
197
00:10:55.771 --> 00:11:00.420
And those projects cost on the order
of $25 or $30 million to sequence.
198
00:11:00.420 --> 00:11:05.210
So that cost started to drop, from that
point on dropped very rapidly, and
199
00:11:05.210 --> 00:11:08.775
then around 2007, there's an introduction
of a new technology from a company called
200
00:11:08.775 --> 00:11:14.120
Solexa, now called Illumina,
that led to even more rapid drops in cost,
201
00:11:14.120 --> 00:11:18.340
because the sequencing technology
itself changed really dramatically and
202
00:11:18.340 --> 00:11:22.140
we'll talk about that a little
bit later in this course.
203
00:11:22.140 --> 00:11:25.020
But as a result,
the sequencing cost today for
204
00:11:25.020 --> 00:11:27.220
a human genome is on the order of $1000.
205
00:11:27.220 --> 00:11:31.868
So we've gone from $25 to $30
million to $1,000 in the space of
206
00:11:31.868 --> 00:11:33.211
about a dozen years.
207
00:11:33.211 --> 00:11:37.343
And that opens up a world of experiments
that we didn't think were feasible before,
208
00:11:37.343 --> 00:11:40.617
not only because of the time involved but
also because of the cost.
209
00:11:40.617 --> 00:11:42.626
So finally, where is all this data?
210
00:11:42.626 --> 00:11:47.229
So there are now trillions of bases of
data that have already been generated.
211
00:11:48.240 --> 00:11:50.870
You and I can go and
download this data and study it ourselves.
212
00:11:50.870 --> 00:11:55.500
Even though this data has been published
and deposited in public archives,
213
00:11:55.500 --> 00:11:58.230
that doesn't mean that there's
nothing more to learn from it.
214
00:11:58.230 --> 00:12:02.190
The convention in the field is that
once you publish a paper describing
215
00:12:02.190 --> 00:12:04.660
some genomic data set,
you're required to release it, and
216
00:12:04.660 --> 00:12:06.710
generally release it with no restrictions.
217
00:12:06.710 --> 00:12:09.900
So there's a terrific set of
repositories of all this data.
218
00:12:09.900 --> 00:12:15.070
The biggest one is the National Center for
Biotechnology Information or NCBI.
219
00:12:15.070 --> 00:12:18.930
The raw data is deposited there in
something called the Sequence Read Archive
220
00:12:18.930 --> 00:12:20.150
or SRA.
221
00:12:20.150 --> 00:12:23.540
But many more databases are contained
within NCBI that contain, for
222
00:12:23.540 --> 00:12:26.120
example, the names and locations of all
223
00:12:26.120 --> 00:12:29.630
the genes that are present in all
the genomes that we've been sequencing.
224
00:12:29.630 --> 00:12:31.730
So this is a great resource for
people who want to go and
225
00:12:31.730 --> 00:12:34.650
try to make new discoveries,
not only about the human genome, but
226
00:12:34.650 --> 00:12:37.860
about the many other thousands of species
that we're engaged in sequencing.