Mapreduce DNA Sequencing

03/10/2011 CS 555 - Biological and Linguistic Sequence Analysis 1
MapReduce applications in Next

Generation Sequencing
Sequencing Applications
Short read mapping
CloudBurst (Schatz 200!
Cross"o# ($angmead et al% 200!
A &e# slides stolen &rom presentations "'

Michael Schatz
5/3/2011 CS 506 Prole! Sol"ing #it$ Large
Clusters
2
(NA)RNA Sequencing
Gi*en a sample o& (NA or RNA+ read the sequence o&

"ases that ma,e it up
-sed to "e slo# . manual / Sanger sequencing
Current generation o& massi*el' parallel machines

ena"le high throughput sequencing o& entire
genomes (0next generation1!
Clusters
3
Sequencing 2echnolog'
Clusters
%
Sequencing Costs
Clusters
5
Re&erence Genomes
$ots o& e&&ort has gone into pu"lishing 0re&erence1

genomes
e%g% 3uman in 2004
Com"ine contri"utions &rom man' indi*iduals to

descri"e 0t'pical1 genome sequence &or species
2housands o& re&erence genomes ha*e no# "een

pu"lished &or *arious organisms
56ll "e tal,ing a"out applications that use the human

re&erence
4%2 "illion "ases+ ,eeps getting re&ined+ considered

6complete enough6
Clusters
6
Sequencing (ata
Current sequencing machines produce lots o& data
e%g% 5llumina 3iSeq / up to 27GB)da'
But there6s a catch8 short reads
Machines output short sequences (27 / 900"p! in

"atches o& 90s or 900s o& millions
Bioin&ormatics challenge is to ma,e sense o& all o& these

short reads
Man' applications are "ased on &inding mapping

locations o& sample reads in the re&erence genome
Can "e sol*ed "' lots o& &ast approximate

matching)local alignment
Clusters
&
:xample Read
2'pical sequence pair in ;AS2< &ormat8
@GA1_0010:3:120:19851:8281#0/1
NTGTGGTTTATCTATCCACCAGGACAGATTTCTACA
+GA1_0010:3:120:19851:8281#0/1
KSSPPUXXVU[X]^^_^X[]Z__^_X^_____]\^`
@GA1_0010:3:120:19851:8281#0/2
ACCAATTTACTGTATTAGTCCATCTTAATAAGAAAT
+GA1_0010:3:120:19851:8281#0/2
ddcdcacddc`dcdd`d`adddddYdddadddddcc
Sequence Name
= in pair
(NA Sequence
<ualit' Scores
Reads can "e generated in pairs that come &rom a ,no#n

distance apart in the sample
Can "e a mapping hint+ among other uses
<ualit' scores indicate the pro"a"ilit' that the "ase in that

position #as called "' error
Clusters
'
B' &inding approx% match locations o& reads &rom a sample

in the re&erence+ 'ou can characterize the genetic
*ariations in the sample compared to the re&erence
;or example+ single nucleotide pol'morphism (SN>!

disco*er' / single "ases #here the sample di&&ers &rom the
re&erence
CAAATAGGC
AAATAGGCT
GCAAATCGG
AATAGGCTT
ATAGGCTTA
...ACGTAGCAAATCGGCTTACTAGACCAATTTAC... Re&erence
SN>
Clusters
(
Some experiments can generate reads in pairs that ha*e a

,no#n distance apart in the sample genome
(istance can "e used to direct matching
Mappings o& pairs that aren6t concordant #ith the expected

distance can indicate larger *ariations li,e deletions+
insertions+ and in*ersions
Med*ede* et al 200+ Nature Methods
Clusters
10
RNA?S:<8
Sequence RNA
Map reads "ac, to

re&erence
-se num"er o&

reads to estimate
gene expression
;ind #hich exons

are ma,ing it into
the &inal gene
product (0alternati*e
splicing1!
5llumina%com
Clusters
11
Chromatin
5mmunoprecipitation
(Ch5>?S:<!8
2ranscription ;actors8
proteins that "ind to (NA
to turn genes on or o&&
Ch5>?S:< captures reads

near 2;s
Map "ac, to re&erence+

learn 2; "inding sites+
&igure out ho# genes
regulate each other
@aloue* et al+ Nature Methods+ 200A%
Clusters
12
Bisul&ite Sequencing
5n cells+ meth'l groups are added to c'tosine (C! "ases in

the (NA strand+ especiall' 6CpG6 sitesB this usuall' causes
gene repression
:xample o& an 6epigenetic6 mechanism8 herita"le change to

(NA #ithout alterations in (NA sequence
Bisul&ite turns unmeth'lated C6s into -6s "ut lea*es

meth'lated C6s aloneB -6s get con*erted "ac, to 26s in reads
Mapping to con*erted *ersions o& re&erence sho# #here

meth'lation is happening
m m
GCCCGTCACACG
CGGGCAGTGTGC
m m
GTTCGTTATACG
TGGGCAGTGTGC
GTTCGTTATACG
CAAGCAATATGC
TGGGCAGTGTGC
ACCCGTCACACG
Clusters
13
Short Read Mapping Challenges
All o& these applications "oil do#n to e&&icient approximate

matching o& large num"ers o& short sequences%
2he human genome is made o& C70D repetiti*e sequences

/ causes mapping am"iguit'%
(NA is dou"le stranded+ so reads could come &rom either

strand8 need to map re*erse complement as #ell
AC2GACCG2 E ACGG2CAG2
3a*e to account &or "oth sequencing (machine! errors and

*ariations "et#een the sample and re&erence
2#o main strategies used "' non?distri"uted short read

aligners8 k?mer 3ashing and Su&&ix 2ree)Arra' methods
Clusters
1%
Seed?and?:xtend
RMA> aligner (Smith et al 200A! "ased on the Baeza?

Fates and >erle"erg algorithm &or approximate matching
Goal is to &ind a mapping #ithin the re&erence &or a read o&

length n #ith up to k mismatches (or insertions ) deletions!%
Gith "ound on mismatches o& k+ i& #e partition the read

into k H 9 contiguous seeds+ at least one #ill match the
re&erence exactl'
-se an index li,e a hash ta"le to match to a candidate

location in the re&erence+ and then tr' to align the #hole
read there #ith a scan &or mismatches or d'namic
programming in the surrounding region
Clusters
15
3adoop . MapReduce
1. Map: Catalog K-mers
Emit every k-mer in the genome and non-overlapping k-mers in the reads
Simultaneously index the genome and join ith the reads
Human chromosome 1
Read 1
Read 2
map
2. Shuffle: Coalesce Seeds
Hadoop internal shuffle groups together k-mers shared by the reads and the reference
Conceptually build a hash table of k-mers and their occurrences
shuffle
!. Reduce: "nd-to-end alignment

#ocally e$tend alignment beyond seeds by counting mismatches% or &ith
#andau-'ishkin k-difference algorithm to allo& for indels.
(f read aligns end-to-end% record the alignment
reduce
Read 1, Chromosome 1, 12345-12365
Read 2, Chromosome 1, 12350-12370
CloudBurst Architecture
Cloud"urst Algorithm
Cloud"urst Map
Class Mapper
&unction map(seq!
i& isIread(seq!8
&or#ardIseeds J , H 9 contiguous su"strings o& seq
&or each seed in &or#ardIseeds8
emit(seed+ (readIid+ seedIposition+ isIre&J0+ isIRCJ0+ le&tI&lan,+ rightI&lan,!!
re*erseIseeds J , H 9 contiguous su"strings o& re*erseIcomplement(seq!
emit(seed+ (readIid+ seedIposition+ isIre&J0+ isIRCJ9+ le&tI&lan,+ rightI&lan,!!
else i& isIre&(seq!8
seeds J all su"strings o& length seedIlength in seq
&or each seed in seeds8
emit(seed+ (chromosomeIid+ seedIposition+ isIre&J0+ isIRCJ0+ le&tI&lan,+
rightI&lan,!!
CloudBurst Map Redundanc'
Some seeds in the re&erence and reads are &ar more

common than others
:speciall' duplicated "ases8 AAAAA%%%%
>rocessing e*er' possi"le mapping &or these seeds

#ould "e done "' one reducer
Cloud"urst gets around this "'8
:mitting redundant copies o& duplicated "ase seeds &rom

the re&erence8 AAAA?9+ AAAA?2+ AAAA?4+ K%
;or duplicated "ase seeds &rom the reads+ emit #ith a

random su&&ix8 AAAA?2
2hat #a' these reads are processed across se*eral

reducers
Cloud"urst Reduce
Class Reducer
&unction con&igure(!
re&Iseeds J ne# $ist(!
&unction reduce(seed+ (readIid+ seedIposition+ isIre&J0+ isIRCJ0+ le&tI&lan,+ rightI&lan,!!
5& (isIre&!8
re&Iseeds%add(seed+ re&I*alues!
else8
&or re&Iseed in re&Iseeds8
scan le&tI&lan, and rightI&lan, &rom current and re&I*alues &or mismatches
5& mismatches L ,8
emit(readIid+ re&I*alues%id+ position+ strand+ = mismatches!

MapReduce sort and shu&&le ensures that all sequences that contain
a seed are processed "' the same reducer
-se an intermediate sorter to send re&erence sequences to the

reducer "e&ore read sequences "' sorting on isIre&
Clusters
21
Cloud"urst Runtimes
Clusters
22
Cloud"urst *s% RMA>
Clusters
23
CloudBurst *s% Cross"o#
CloudBurst paper got people tal,ing a"out MapReduce in

"ioin&ormatics+ "ut more o& a proo&?o&?concept than production?read'
aligner
Missing some &eatures li,e qualit' scores and paired?end mapping
5n practice+ emitting all possi"le mappings is extremel' dis,?intensi*e
-suall' onl' need the "est mapping
A more t'pical use o& MapReduce is Cross"o#+ #hich parallelizes

Bo#tie+ an aligner "ased on BG2);M?index
2he BG2);M?index is a compressed su&&ix arra' representation o&

the re&erence sequence%
3uman re&erence can "e stored in C 4GB+ and there&ore can "e
made a*aila"le to mappers in the 3adoop distri"uted cache
Clusters
2%
Clusters
25
Cross"o# Results
Clusters
26
Cross"o# on Amazon :C2
Clusters
2&
Short read alignment / Su&&ix arra's
Su&&ix tree)arra' "ased methods are *er' &ast &or exact

match+ as #e6*e seen
;inds multiple match locations at the same time
3o#e*er+ the' ha*e large memor' requirements
Assume M "'te integers+ then a su&&ix arra' needs MN

"'tes N 92GB &or the human genome
5mplementation tric,s and compression can get this

do#n+ "ut not "' much
2o get around this+ the current crop o& aligners use the
;M?5ndex+ "ased on the Burroughs?Gheeler 2rans&orm
(BG2!
Clusters
2'
Burroughs?Gheeler 2rans&orm
BG2 (Burroughs and Gheeler+ 9M! is a

trans&ormation on a string X related to "uilding a su&&ix
arra'% 2he eas' #a' to "uild it8
Add an end o& character s'm"ol O that is lexicographicall'

smaller than all characters in to the end o& X
Create a matrix o& all c'clic rotations o& P and sort it
BG2(X! is the last column o& the sorted matrix
2he BG2 #as originall' designed &or compression

purposes
Clusters
2(
Computing BG2
0 ACAACG$
1 CAACG$A
2 AACG$AC
3 ACG$ACA
4 CG$ACAA
5 G$ACAAC
6 $ACAACG
ACAACG$
0123456
GC$AAAC
Append O
2a,e $ast
Column
P8 ACAACGO
Su&&ix arra'8
QR+2+0+4+9+M+7S
BG2(X!8 GCOAAAC
6 $ACAACG
2 AACG$AC
0 ACAACG$
3 ACG$ACA
1 CAACG$A
4 CG$ACAA
5 G$ACAAC
Create C'cled
Strings
Sort Ro#s
Clusters
30
Relation o& BG2 and Su&&ix Arra's
;or our string ACAACG+ the su&&ix arra' S is

QR+2+0+4+9+M+7S+ and BG2(X! is GCOAAAC
BG2(X! and S are related8
BG2(i! J O i& SQiS JJ 0
BG2(i! J X(SQiS / 9! other#ise
2here&ore+ #e can use linear time and space

algorithms &or "uilding su&&ix arra's to compute
BG2(X!
Clusters
31
Aside on BG2 Compression
Gh' is BG2 use&ul &or compressionT
2he trans&orm attempts to ta,e ad*antage o& patterns in the

text to produce a string #ith more repeated characters
5magine the BG2 o& an :nglish text+ #hich contains man'

instances o& 6the6
2he sorted matrix #ill ha*e man' ro#s that start #ith 6he6
and end #ith 6t6+ #hich means the last column #ill ha*e a
large run o& 6t6s
2his ma,es it eas' to compress
2here are &ast algorithms &or re*ersing the trans&orm and

searching in it #ithout decompressing
Clusters
32
BG2 properties
2he matrix o& sorted c'clic rotations has the propert'

o& 6last?&irst mapping6
2he ith occurrence o& char a in the last column is the same
character in X as the ith occurrence o& a in the &irst column
Ge can use this propert' to re*erse the trans&ormation

$angmead et al+ Genome Biology+ 200%
Clusters
33
2he ;M 5ndex
;erragina and Manzini (2000! de*eloped methods &or

searching BG2 trans&ormed strings #ith minimal
memor' requirements
;irst #e need to de&ine t#o sets o& *alues8
$et C(a! "e the count o& s'm"ols in XQ0+n?2S that are
lexicographicall' smaller than a+ #here n is the length o&
X
$et O(a,i! "e the num"er o& occurrences o& a in

BG2(X!Q0+iS
-sing these *alues+ #e can quic,l' search &or

su"strings using ranges on the BG2 string
Clusters
3%
Searching #ith the ;M 5ndex
Searching &or the pattern 0AAC1 in our string ACAACG
$et c "e the current character+ sp "e the start o& our range on BG2
and ep "e the end
Start #ith c equal to the last character in the pattern

5nitialize8
c J 6C6
sp J CQcS H 9 J M
ep J CQcH9S H 9 J CQ6G6S H 9 J R
CQOS J 0 U(O+9!J0 U(O+2!J0 U(O+4!J0 U(O+M!J0 U(O+7!J0 U(O+R!J0
CQAS J 0 U(A+9!J0 U(A+2!J0 U(A+4!J0 U(A+M!J9 U(A+7!J2 U(A+R!J4
CQCS J 4 U(C+9!J0 U(C+2!J9 U(C+4!J9 U(C+M!J9 U(C+7!J9 U(C+R!J9
CQGS J 7 U(G+9!J9 U(G+2!J9 U(G+4!J9 U(G+M!J9 U(G+7!J9 U(G+R!J9
Clusters
35

Mo*e to next character
c J 6A6
sp J CQcS H U(c+sp! H 9 J 2
ep J CQcS H U(c+ep! H 9 J M
Clusters
36

Mo*e to next character
c J 6A6
sp J CQcS H U(c+sp! H 9 J 9
ep J CQcS H U(c+ep! H 9 J 2
Stop #hen pattern exhausted or sp VJ ep
Clusters
3&
Ad*antages o& BG2);M
Bac,#ards searching is equi*alent to tra*ersing a

su&&ix tree% But #h' go to this trou"leT
Uur goal #as to sa*e memor'+ "ut the U(a+i! arra'

#ould "e *er' large8 Mn integers
2he tric, is to onl' store a &e# entries &rom the U arra'

and compute the rest on the &l' &rom the BG2 string
;or example BGA ($i and (ur"in+ 200! onl' stores

O(W+i! &or i6s that are &actors o& 92A
A (NA string+ or its BG2+ can "e stored using onl' t#o
"its per character
2his leads to a massi*e memor' sa*ings

Clusters
3'
5nexact tree matching
Can use su&&ix tree to align

portion o& the read and
extend (li,e BF>!
Some use greed'

"ac,trac,ing approaches that
do not guarantee that all
matches #ill "e &ound
Uthers do all possi"le

"ac,trac,ings
2he dotted line sho#s a path

through the pre&ix tree o&
0GUUGU$1 loo,ing &or the
pattern 0$U$1 #ith 9 mismatch
$i and (ur"in+ Bioinformatics+ 200
Clusters
3(
5nexact tree matching
BGA ($i and (ur"in+ 200!

computes a lo#er "ound on the
num"er o& possi"le mismatches
&or su"strings o& the pattern
;or the pre*ious example+

0$U$1 and 0GUUGU$18
((0! J 0
((9! J 9
N 0$U1 is not a su"string
((2! J 9
2here&ore #ould not tra*erse

the 0G1 and 0U1 "ranches
Clusters
%0
Additional Complications
5ncorporating "ase qualit' scores
Bo#tie ($angmead et al 200! chooses "ac,trac,ing paths

"ased on qualit' scores
Uthers re?ran, alignments "ased on the qualit' scores o&

their matches and mismatches
Both the reads and their re*erse complements need to

"e mapped to the re&erence
Need to either hold t#o data structures in memor' or con*ert

on the &l'
Clusters
%1
Current State o& Short Read Aligners
3ash?5ndex "ased aligners are the most sensiti*e
5mportant &or applications #here mismatches cause "ig

pro"lems li,e structural *ariation detection
Can "e *er' slo# and use lots o& memor'
Su&&ix tree "ased aligners are gaining in popularit'
Much &aster
(ramaticall' less memor' usage (4GB *s% A?92GB!
Une recent strateg' is to use a tiered alignment

strateg'+ #ith a tree "ased aligner &irst and an indexed
aligner on the reads it &ails to map unam"iguousl'
Clusters
%2
de novo Short Read Assem"l'
Ghat i& there6s no genome re&erence to align toT
Sequencing ne# species &rom scratch
Metagenomic studies / ta,e a sample (o& #ater+ soil+ etc!

and sequence all the (NA 'ou &ind in it
Characterizing areas that are hea*il' di*ergent &rom the

re&erence
Can6t cast these pro"lems as mapping)alignment an'more
5nstead+ #e #ant to piece the short reads "ac, together to

&ind the long sequences &rom #hich the' came
$oo, &or o*erlaps "et#een reads

Clusters
%3
(e BruiXn Graphs
Most de no*o
assem"lers use a data
structure called a de
BruiXn graph to hold
o*erlaps "et#een reads
:ach node represents a

set o& k?mers &rom the
reads that o*erlap "' k /
9 s'm"ols
Yer"ino and Birne'+ Genome es+ 200A
Clusters
%%
Assem"l' #ith de BruiXn Graphs
-se a set o& hash ta"le indices to add each read to the
graph
Collapse non?"ranching chains o& nodes into single

nodes to produce 6contigs6
2r' to use paired end reads+ #hich come &rom a

,no#n distance apart+ to sca&&old contigs together
2his process is extremel' RAM intensi*e / o&ten

conducted on 792GB supercomputers
Acti*e research into parallelizing and distri"uting the

process

Mapreduce DNA Sequencing

Cargado por

Información del documento

Derechos de autor

Formatos disponibles

Compartir este documento

Compartir o incrustar documentos

Opciones para compartir

¿Le pareció útil este documento?

¿Este contenido es inapropiado?

Copyright:

Formatos disponibles

Mapreduce DNA Sequencing

Cargado por

Copyright:

Formatos disponibles

03/10/2011 CS 555 - Biological and Linguistic Sequence Analysis 1

MapReduce applications in Next

Short read mapping

CloudBurst (Schatz 200!

Cross"o# ($angmead et al% 200!

A &e# slides stolen &rom presentations "'

Gi*en a sample o& (NA or RNA+ read the sequence o&

-sed to "e slo# . manual / Sanger sequencing

Current generation o& massi*el' parallel machines

$ots o& e&&ort has gone into pu"lishing 0re&erence1

e%g% 3uman in 2004

Com"ine contri"utions &rom man' indi*iduals to

2housands o& re&erence genomes ha*e no# "een

56ll "e tal,ing a"out applications that use the human

4%2 "illion "ases+ ,eeps getting re&ined+ considered

Current sequencing machines produce lots o& data

e%g% 5llumina 3iSeq / up to 27GB)da'

But there6s a catch8 short reads

Machines output short sequences (27 / 900"p! in

Bioin&ormatics challenge is to ma,e sense o& all o& these

Man' applications are "ased on &inding mapping

Can "e sol*ed "' lots o& &ast approximate

Reads can "e generated in pairs that come &rom a ,no#n

Can "e a mapping hint+ among other uses

<ualit' scores indicate the pro"a"ilit' that the "ase in that

B' &inding approx% match locations o& reads &rom a sample

;or example+ single nucleotide pol'morphism (SN>!

Some experiments can generate reads in pairs that ha*e a

(istance can "e used to direct matching

Mappings o& pairs that aren6t concordant #ith the expected

Map reads "ac, to

-se num"er o&

;ind #hich exons

Ch5>?S:< captures reads

Map "ac, to re&erence+

5n cells+ meth'l groups are added to c'tosine (C! "ases in

:xample o& an 6epigenetic6 mechanism8 herita"le change to

Bisul&ite turns unmeth'lated C6s into -6s "ut lea*es

Mapping to con*erted *ersions o& re&erence sho# #here

All o& these applications "oil do#n to e&&icient approximate

2he human genome is made o& C70D repetiti*e sequences

(NA is dou"le stranded+ so reads could come &rom either

3a*e to account &or "oth sequencing (machine! errors and

2#o main strategies used "' non?distri"uted short read

RMA> aligner (Smith et al 200A! "ased on the Baeza?

Goal is to &ind a mapping #ithin the re&erence &or a read o&

Gith "ound on mismatches o& k+ i& #e partition the read

-se an index li,e a hash ta"le to match to a candidate

!. Reduce: "nd-to-end alignment

Some seeds in the re&erence and reads are &ar more

:speciall' duplicated "ases8 AAAAA%%%%

>rocessing e*er' possi"le mapping &or these seeds

Cloud"urst gets around this "'8

:mitting redundant copies o& duplicated "ase seeds &rom

;or duplicated "ase seeds &rom the reads+ emit #ith a

2hat #a' these reads are processed across se*eral

-se an intermediate sorter to send re&erence sequences to the

CloudBurst paper got people tal,ing a"out MapReduce in

Missing some &eatures li,e qualit' scores and paired?end mapping

5n practice+ emitting all possi"le mappings is extremel' dis,?intensi*e

-suall' onl' need the "est mapping

A more t'pical use o& MapReduce is Cross"o#+ #hich parallelizes

2he BG2);M?index is a compressed su&&ix arra' representation o&

Su&&ix tree)arra' "ased methods are *er' &ast &or exact

Mapping to conerted ersions o& re&erence sho# #here

3o#eer+ the' hae large memor' requirements

Bac,#ards searching is equialent to traersing a

2his leads to a massie memor' saings

Characterizing areas that are heail' diergent &rom the