Está en la página 1de 44

03/10/2011 CS 555 - Biological and Linguistic Sequence Analysis 1

MapReduce applications in Next


Generation Sequencing

Sequencing Applications

Short read mapping

CloudBurst (Schatz 200!

Cross"o# ($angmead et al% 200!

A &e# slides stolen &rom presentations "'


Michael Schatz
5/3/2011 CS 506 Prole! Sol"ing #it$ Large
Clusters
2
(NA)RNA Sequencing

Gi*en a sample o& (NA or RNA+ read the sequence o&


"ases that ma,e it up

-sed to "e slo# . manual / Sanger sequencing

Current generation o& massi*el' parallel machines


ena"le high throughput sequencing o& entire
genomes (0next generation1!
5/3/2011 CS 506 Prole! Sol"ing #it$ Large
Clusters
3
Sequencing 2echnolog'
5/3/2011 CS 506 Prole! Sol"ing #it$ Large
Clusters
%
Sequencing Costs
5/3/2011 CS 506 Prole! Sol"ing #it$ Large
Clusters
5
Re&erence Genomes

$ots o& e&&ort has gone into pu"lishing 0re&erence1


genomes

e%g% 3uman in 2004

Com"ine contri"utions &rom man' indi*iduals to


descri"e 0t'pical1 genome sequence &or species

2housands o& re&erence genomes ha*e no# "een


pu"lished &or *arious organisms

56ll "e tal,ing a"out applications that use the human


re&erence

4%2 "illion "ases+ ,eeps getting re&ined+ considered


6complete enough6
5/3/2011 CS 506 Prole! Sol"ing #it$ Large
Clusters
6
Sequencing (ata

Current sequencing machines produce lots o& data

e%g% 5llumina 3iSeq / up to 27GB)da'

But there6s a catch8 short reads

Machines output short sequences (27 / 900"p! in


"atches o& 90s or 900s o& millions

Bioin&ormatics challenge is to ma,e sense o& all o& these


short reads

Man' applications are "ased on &inding mapping


locations o& sample reads in the re&erence genome

Can "e sol*ed "' lots o& &ast approximate


matching)local alignment
5/3/2011 CS 506 Prole! Sol"ing #it$ Large
Clusters
&
:xample Read
2'pical sequence pair in ;AS2< &ormat8
@GA1_0010:3:120:19851:8281#0/1
NTGTGGTTTATCTATCCACCAGGACAGATTTCTACA
+GA1_0010:3:120:19851:8281#0/1
KSSPPUXXVU[X]^^_^X[]Z__^_X^_____]\^`
@GA1_0010:3:120:19851:8281#0/2
ACCAATTTACTGTATTAGTCCATCTTAATAAGAAAT
+GA1_0010:3:120:19851:8281#0/2
ddcdcacddc`dcdd`d`adddddYdddadddddcc
Sequence Name
= in pair
(NA Sequence
<ualit' Scores

Reads can "e generated in pairs that come &rom a ,no#n


distance apart in the sample

Can "e a mapping hint+ among other uses

<ualit' scores indicate the pro"a"ilit' that the "ase in that


position #as called "' error
5/3/2011 CS 506 Prole! Sol"ing #it$ Large
Clusters
'
Sequencing Applications

B' &inding approx% match locations o& reads &rom a sample


in the re&erence+ 'ou can characterize the genetic
*ariations in the sample compared to the re&erence

;or example+ single nucleotide pol'morphism (SN>!


disco*er' / single "ases #here the sample di&&ers &rom the
re&erence
CAAATAGGC
AAATAGGCT
GCAAATCGG
AATAGGCTT
ATAGGCTTA
...ACGTAGCAAATCGGCTTACTAGACCAATTTAC... Re&erence
SN>
5/3/2011 CS 506 Prole! Sol"ing #it$ Large
Clusters
(
Sequencing Applications

Some experiments can generate reads in pairs that ha*e a


,no#n distance apart in the sample genome

(istance can "e used to direct matching

Mappings o& pairs that aren6t concordant #ith the expected


distance can indicate larger *ariations li,e deletions+
insertions+ and in*ersions
Med*ede* et al 200+ Nature Methods
5/3/2011 CS 506 Prole! Sol"ing #it$ Large
Clusters
10
Sequencing Applications

RNA?S:<8

Sequence RNA

Map reads "ac, to


re&erence

-se num"er o&


reads to estimate
gene expression

;ind #hich exons


are ma,ing it into
the &inal gene
product (0alternati*e
splicing1!
5llumina%com
5/3/2011 CS 506 Prole! Sol"ing #it$ Large
Clusters
11
Sequencing Applications

Chromatin
5mmunoprecipitation
(Ch5>?S:<!8

2ranscription ;actors8
proteins that "ind to (NA
to turn genes on or o&&

Ch5>?S:< captures reads


near 2;s

Map "ac, to re&erence+


learn 2; "inding sites+
&igure out ho# genes
regulate each other
@aloue* et al+ Nature Methods+ 200A%
5/3/2011 CS 506 Prole! Sol"ing #it$ Large
Clusters
12
Sequencing Applications

Bisul&ite Sequencing

5n cells+ meth'l groups are added to c'tosine (C! "ases in


the (NA strand+ especiall' 6CpG6 sitesB this usuall' causes
gene repression

:xample o& an 6epigenetic6 mechanism8 herita"le change to


(NA #ithout alterations in (NA sequence

Bisul&ite turns unmeth'lated C6s into -6s "ut lea*es


meth'lated C6s aloneB -6s get con*erted "ac, to 26s in reads

Mapping to con*erted *ersions o& re&erence sho# #here


meth'lation is happening
m m
GCCCGTCACACG
CGGGCAGTGTGC
m m
GTTCGTTATACG
TGGGCAGTGTGC
GTTCGTTATACG
CAAGCAATATGC
TGGGCAGTGTGC
ACCCGTCACACG
5/3/2011 CS 506 Prole! Sol"ing #it$ Large
Clusters
13
Short Read Mapping Challenges

All o& these applications "oil do#n to e&&icient approximate


matching o& large num"ers o& short sequences%

2he human genome is made o& C70D repetiti*e sequences


/ causes mapping am"iguit'%

(NA is dou"le stranded+ so reads could come &rom either


strand8 need to map re*erse complement as #ell
AC2GACCG2 E ACGG2CAG2

3a*e to account &or "oth sequencing (machine! errors and


*ariations "et#een the sample and re&erence

2#o main strategies used "' non?distri"uted short read


aligners8 k?mer 3ashing and Su&&ix 2ree)Arra' methods
5/3/2011 CS 506 Prole! Sol"ing #it$ Large
Clusters
1%
Seed?and?:xtend

RMA> aligner (Smith et al 200A! "ased on the Baeza?


Fates and >erle"erg algorithm &or approximate matching

Goal is to &ind a mapping #ithin the re&erence &or a read o&


length n #ith up to k mismatches (or insertions ) deletions!%

Gith "ound on mismatches o& k+ i& #e partition the read


into k H 9 contiguous seeds+ at least one #ill match the
re&erence exactl'

-se an index li,e a hash ta"le to match to a candidate


location in the re&erence+ and then tr' to align the #hole
read there #ith a scan &or mismatches or d'namic
programming in the surrounding region
5/3/2011 CS 506 Prole! Sol"ing #it$ Large
Clusters
15
3adoop . MapReduce
1. Map: Catalog K-mers
Emit every k-mer in the genome and non-overlapping k-mers in the reads
Simultaneously index the genome and join ith the reads
Human chromosome 1
Read 1
Read 2
map
2. Shuffle: Coalesce Seeds
Hadoop internal shuffle groups together k-mers shared by the reads and the reference
Conceptually build a hash table of k-mers and their occurrences
shuffle

!. Reduce: "nd-to-end alignment


#ocally e$tend alignment beyond seeds by counting mismatches% or &ith
#andau-'ishkin k-difference algorithm to allo& for indels.
(f read aligns end-to-end% record the alignment
reduce
Read 1, Chromosome 1, 12345-12365
Read 2, Chromosome 1, 12350-12370
CloudBurst Architecture
Cloud"urst Algorithm
Cloud"urst Map
Class Mapper
&unction map(seq!
i& isIread(seq!8
&or#ardIseeds J , H 9 contiguous su"strings o& seq
&or each seed in &or#ardIseeds8
emit(seed+ (readIid+ seedIposition+ isIre&J0+ isIRCJ0+ le&tI&lan,+ rightI&lan,!!
re*erseIseeds J , H 9 contiguous su"strings o& re*erseIcomplement(seq!
emit(seed+ (readIid+ seedIposition+ isIre&J0+ isIRCJ9+ le&tI&lan,+ rightI&lan,!!
else i& isIre&(seq!8
seeds J all su"strings o& length seedIlength in seq
&or each seed in seeds8
emit(seed+ (chromosomeIid+ seedIposition+ isIre&J0+ isIRCJ0+ le&tI&lan,+
rightI&lan,!!
CloudBurst Map Redundanc'

Some seeds in the re&erence and reads are &ar more


common than others

:speciall' duplicated "ases8 AAAAA%%%%

>rocessing e*er' possi"le mapping &or these seeds


#ould "e done "' one reducer

Cloud"urst gets around this "'8

:mitting redundant copies o& duplicated "ase seeds &rom


the re&erence8 AAAA?9+ AAAA?2+ AAAA?4+ K%

;or duplicated "ase seeds &rom the reads+ emit #ith a


random su&&ix8 AAAA?2

2hat #a' these reads are processed across se*eral


reducers
Cloud"urst Reduce
Class Reducer
&unction con&igure(!
re&Iseeds J ne# $ist(!
&unction reduce(seed+ (readIid+ seedIposition+ isIre&J0+ isIRCJ0+ le&tI&lan,+ rightI&lan,!!
5& (isIre&!8
re&Iseeds%add(seed+ re&I*alues!
else8
&or re&Iseed in re&Iseeds8
scan le&tI&lan, and rightI&lan, &rom current and re&I*alues &or mismatches
5& mismatches L ,8
emit(readIid+ re&I*alues%id+ position+ strand+ = mismatches!

MapReduce sort and shu&&le ensures that all sequences that contain
a seed are processed "' the same reducer

-se an intermediate sorter to send re&erence sequences to the


reducer "e&ore read sequences "' sorting on isIre&
5/3/2011 CS 506 Prole! Sol"ing #it$ Large
Clusters
21
Cloud"urst Runtimes
5/3/2011 CS 506 Prole! Sol"ing #it$ Large
Clusters
22
Cloud"urst *s% RMA>
5/3/2011 CS 506 Prole! Sol"ing #it$ Large
Clusters
23
CloudBurst *s% Cross"o#

CloudBurst paper got people tal,ing a"out MapReduce in


"ioin&ormatics+ "ut more o& a proo&?o&?concept than production?read'
aligner

Missing some &eatures li,e qualit' scores and paired?end mapping

5n practice+ emitting all possi"le mappings is extremel' dis,?intensi*e

-suall' onl' need the "est mapping

A more t'pical use o& MapReduce is Cross"o#+ #hich parallelizes


Bo#tie+ an aligner "ased on BG2);M?index

2he BG2);M?index is a compressed su&&ix arra' representation o&


the re&erence sequence%

3uman re&erence can "e stored in C 4GB+ and there&ore can "e
made a*aila"le to mappers in the 3adoop distri"uted cache
5/3/2011 CS 506 Prole! Sol"ing #it$ Large
Clusters
2%
5/3/2011 CS 506 Prole! Sol"ing #it$ Large
Clusters
25
Cross"o# Results
5/3/2011 CS 506 Prole! Sol"ing #it$ Large
Clusters
26
Cross"o# on Amazon :C2
5/3/2011 CS 506 Prole! Sol"ing #it$ Large
Clusters
2&
Short read alignment / Su&&ix arra's

Su&&ix tree)arra' "ased methods are *er' &ast &or exact


match+ as #e6*e seen

;inds multiple match locations at the same time

3o#e*er+ the' ha*e large memor' requirements

Assume M "'te integers+ then a su&&ix arra' needs MN


"'tes N 92GB &or the human genome

5mplementation tric,s and compression can get this


do#n+ "ut not "' much

2o get around this+ the current crop o& aligners use the
;M?5ndex+ "ased on the Burroughs?Gheeler 2rans&orm
(BG2!
5/3/2011 CS 506 Prole! Sol"ing #it$ Large
Clusters
2'
Burroughs?Gheeler 2rans&orm

BG2 (Burroughs and Gheeler+ 9M! is a


trans&ormation on a string X related to "uilding a su&&ix
arra'% 2he eas' #a' to "uild it8

Add an end o& character s'm"ol O that is lexicographicall'


smaller than all characters in to the end o& X

Create a matrix o& all c'clic rotations o& P and sort it

BG2(X! is the last column o& the sorted matrix

2he BG2 #as originall' designed &or compression


purposes
5/3/2011 CS 506 Prole! Sol"ing #it$ Large
Clusters
2(
Computing BG2
0 ACAACG$
1 CAACG$A
2 AACG$AC
3 ACG$ACA
4 CG$ACAA
5 G$ACAAC
6 $ACAACG
ACAACG$
0123456
GC$AAAC
Append O
2a,e $ast
Column
P8 ACAACGO
Su&&ix arra'8
QR+2+0+4+9+M+7S
BG2(X!8 GCOAAAC
6 $ACAACG
2 AACG$AC
0 ACAACG$
3 ACG$ACA
1 CAACG$A
4 CG$ACAA
5 G$ACAAC
Create C'cled
Strings
Sort Ro#s
5/3/2011 CS 506 Prole! Sol"ing #it$ Large
Clusters
30
Relation o& BG2 and Su&&ix Arra's

;or our string ACAACG+ the su&&ix arra' S is


QR+2+0+4+9+M+7S+ and BG2(X! is GCOAAAC

BG2(X! and S are related8

BG2(i! J O i& SQiS JJ 0

BG2(i! J X(SQiS / 9! other#ise

2here&ore+ #e can use linear time and space


algorithms &or "uilding su&&ix arra's to compute
BG2(X!
5/3/2011 CS 506 Prole! Sol"ing #it$ Large
Clusters
31
Aside on BG2 Compression

Gh' is BG2 use&ul &or compressionT

2he trans&orm attempts to ta,e ad*antage o& patterns in the


text to produce a string #ith more repeated characters

5magine the BG2 o& an :nglish text+ #hich contains man'


instances o& 6the6

2he sorted matrix #ill ha*e man' ro#s that start #ith 6he6
and end #ith 6t6+ #hich means the last column #ill ha*e a
large run o& 6t6s

2his ma,es it eas' to compress

2here are &ast algorithms &or re*ersing the trans&orm and


searching in it #ithout decompressing
5/3/2011 CS 506 Prole! Sol"ing #it$ Large
Clusters
32
BG2 properties

2he matrix o& sorted c'clic rotations has the propert'


o& 6last?&irst mapping6

2he ith occurrence o& char a in the last column is the same
character in X as the ith occurrence o& a in the &irst column

Ge can use this propert' to re*erse the trans&ormation


$angmead et al+ Genome Biology+ 200%
5/3/2011 CS 506 Prole! Sol"ing #it$ Large
Clusters
33
2he ;M 5ndex

;erragina and Manzini (2000! de*eloped methods &or


searching BG2 trans&ormed strings #ith minimal
memor' requirements

;irst #e need to de&ine t#o sets o& *alues8

$et C(a! "e the count o& s'm"ols in XQ0+n?2S that are
lexicographicall' smaller than a+ #here n is the length o&
X

$et O(a,i! "e the num"er o& occurrences o& a in


BG2(X!Q0+iS

-sing these *alues+ #e can quic,l' search &or


su"strings using ranges on the BG2 string
5/3/2011 CS 506 Prole! Sol"ing #it$ Large
Clusters
3%
Searching #ith the ;M 5ndex

Searching &or the pattern 0AAC1 in our string ACAACG

$et c "e the current character+ sp "e the start o& our range on BG2
and ep "e the end

Start #ith c equal to the last character in the pattern


5nitialize8
c J 6C6
sp J CQcS H 9 J M
ep J CQcH9S H 9 J CQ6G6S H 9 J R
CQOS J 0 U(O+9!J0 U(O+2!J0 U(O+4!J0 U(O+M!J0 U(O+7!J0 U(O+R!J0
CQAS J 0 U(A+9!J0 U(A+2!J0 U(A+4!J0 U(A+M!J9 U(A+7!J2 U(A+R!J4
CQCS J 4 U(C+9!J0 U(C+2!J9 U(C+4!J9 U(C+M!J9 U(C+7!J9 U(C+R!J9
CQGS J 7 U(G+9!J9 U(G+2!J9 U(G+4!J9 U(G+M!J9 U(G+7!J9 U(G+R!J9
5/3/2011 CS 506 Prole! Sol"ing #it$ Large
Clusters
35
Searching #ith the ;M 5ndex

Searching &or the pattern 0AAC1 in our string ACAACG


Mo*e to next character
c J 6A6
sp J CQcS H U(c+sp! H 9 J 2
ep J CQcS H U(c+ep! H 9 J M
CQOS J 0 U(O+9!J0 U(O+2!J0 U(O+4!J0 U(O+M!J0 U(O+7!J0 U(O+R!J0
CQAS J 0 U(A+9!J0 U(A+2!J0 U(A+4!J0 U(A+M!J9 U(A+7!J2 U(A+R!J4
CQCS J 4 U(C+9!J0 U(C+2!J9 U(C+4!J9 U(C+M!J9 U(C+7!J9 U(C+R!J9
CQGS J 7 U(G+9!J9 U(G+2!J9 U(G+4!J9 U(G+M!J9 U(G+7!J9 U(G+R!J9
5/3/2011 CS 506 Prole! Sol"ing #it$ Large
Clusters
36
Searching #ith the ;M 5ndex

Searching &or the pattern 0AAC1 in our string ACAACG


Mo*e to next character
c J 6A6
sp J CQcS H U(c+sp! H 9 J 9
ep J CQcS H U(c+ep! H 9 J 2
Stop #hen pattern exhausted or sp VJ ep
CQOS J 0 U(O+9!J0 U(O+2!J0 U(O+4!J0 U(O+M!J0 U(O+7!J0 U(O+R!J0
CQAS J 0 U(A+9!J0 U(A+2!J0 U(A+4!J0 U(A+M!J9 U(A+7!J2 U(A+R!J4
CQCS J 4 U(C+9!J0 U(C+2!J9 U(C+4!J9 U(C+M!J9 U(C+7!J9 U(C+R!J9
CQGS J 7 U(G+9!J9 U(G+2!J9 U(G+4!J9 U(G+M!J9 U(G+7!J9 U(G+R!J9
5/3/2011 CS 506 Prole! Sol"ing #it$ Large
Clusters
3&
Ad*antages o& BG2);M

Bac,#ards searching is equi*alent to tra*ersing a


su&&ix tree% But #h' go to this trou"leT

Uur goal #as to sa*e memor'+ "ut the U(a+i! arra'


#ould "e *er' large8 Mn integers

2he tric, is to onl' store a &e# entries &rom the U arra'


and compute the rest on the &l' &rom the BG2 string

;or example BGA ($i and (ur"in+ 200! onl' stores


O(W+i! &or i6s that are &actors o& 92A

A (NA string+ or its BG2+ can "e stored using onl' t#o
"its per character

2his leads to a massi*e memor' sa*ings


5/3/2011 CS 506 Prole! Sol"ing #it$ Large
Clusters
3'
5nexact tree matching

Can use su&&ix tree to align


portion o& the read and
extend (li,e BF>!

Some use greed'


"ac,trac,ing approaches that
do not guarantee that all
matches #ill "e &ound

Uthers do all possi"le


"ac,trac,ings

2he dotted line sho#s a path


through the pre&ix tree o&
0GUUGU$1 loo,ing &or the
pattern 0$U$1 #ith 9 mismatch
$i and (ur"in+ Bioinformatics+ 200
5/3/2011 CS 506 Prole! Sol"ing #it$ Large
Clusters
3(
5nexact tree matching

BGA ($i and (ur"in+ 200!


computes a lo#er "ound on the
num"er o& possi"le mismatches
&or su"strings o& the pattern

;or the pre*ious example+


0$U$1 and 0GUUGU$18

((0! J 0

((9! J 9
N 0$U1 is not a su"string

((2! J 9

2here&ore #ould not tra*erse


the 0G1 and 0U1 "ranches
5/3/2011 CS 506 Prole! Sol"ing #it$ Large
Clusters
%0
Additional Complications

5ncorporating "ase qualit' scores

Bo#tie ($angmead et al 200! chooses "ac,trac,ing paths


"ased on qualit' scores

Uthers re?ran, alignments "ased on the qualit' scores o&


their matches and mismatches

Both the reads and their re*erse complements need to


"e mapped to the re&erence

Need to either hold t#o data structures in memor' or con*ert


on the &l'
5/3/2011 CS 506 Prole! Sol"ing #it$ Large
Clusters
%1
Current State o& Short Read Aligners

3ash?5ndex "ased aligners are the most sensiti*e

5mportant &or applications #here mismatches cause "ig


pro"lems li,e structural *ariation detection

Can "e *er' slo# and use lots o& memor'

Su&&ix tree "ased aligners are gaining in popularit'

Much &aster

(ramaticall' less memor' usage (4GB *s% A?92GB!

Une recent strateg' is to use a tiered alignment


strateg'+ #ith a tree "ased aligner &irst and an indexed
aligner on the reads it &ails to map unam"iguousl'
5/3/2011 CS 506 Prole! Sol"ing #it$ Large
Clusters
%2
de novo Short Read Assem"l'

Ghat i& there6s no genome re&erence to align toT

Sequencing ne# species &rom scratch

Metagenomic studies / ta,e a sample (o& #ater+ soil+ etc!


and sequence all the (NA 'ou &ind in it

Characterizing areas that are hea*il' di*ergent &rom the


re&erence

Can6t cast these pro"lems as mapping)alignment an'more

5nstead+ #e #ant to piece the short reads "ac, together to


&ind the long sequences &rom #hich the' came

$oo, &or o*erlaps "et#een reads


5/3/2011 CS 506 Prole! Sol"ing #it$ Large
Clusters
%3
(e BruiXn Graphs

Most de no*o
assem"lers use a data
structure called a de
BruiXn graph to hold
o*erlaps "et#een reads

:ach node represents a


set o& k?mers &rom the
reads that o*erlap "' k /
9 s'm"ols
Yer"ino and Birne'+ Genome es+ 200A
5/3/2011 CS 506 Prole! Sol"ing #it$ Large
Clusters
%%
Assem"l' #ith de BruiXn Graphs

-se a set o& hash ta"le indices to add each read to the
graph

Collapse non?"ranching chains o& nodes into single


nodes to produce 6contigs6

2r' to use paired end reads+ #hich come &rom a


,no#n distance apart+ to sca&&old contigs together

2his process is extremel' RAM intensi*e / o&ten


conducted on 792GB supercomputers

Acti*e research into parallelizing and distri"uting the


process

También podría gustarte