Documentos de Académico
Documentos de Profesional
Documentos de Cultura
Sequencing Applications
RNA?S:<8
Sequence RNA
Chromatin
5mmunoprecipitation
(Ch5>?S:<!8
2ranscription ;actors8
proteins that "ind to (NA
to turn genes on or o&&
Bisul&ite Sequencing
MapReduce sort and shu&&le ensures that all sequences that contain
a seed are processed "' the same reducer
3uman re&erence can "e stored in C 4GB+ and there&ore can "e
made a*aila"le to mappers in the 3adoop distri"uted cache
5/3/2011 CS 506 Prole! Sol"ing #it$ Large
Clusters
2%
5/3/2011 CS 506 Prole! Sol"ing #it$ Large
Clusters
25
Cross"o# Results
5/3/2011 CS 506 Prole! Sol"ing #it$ Large
Clusters
26
Cross"o# on Amazon :C2
5/3/2011 CS 506 Prole! Sol"ing #it$ Large
Clusters
2&
Short read alignment / Su&&ix arra's
2o get around this+ the current crop o& aligners use the
;M?5ndex+ "ased on the Burroughs?Gheeler 2rans&orm
(BG2!
5/3/2011 CS 506 Prole! Sol"ing #it$ Large
Clusters
2'
Burroughs?Gheeler 2rans&orm
2he sorted matrix #ill ha*e man' ro#s that start #ith 6he6
and end #ith 6t6+ #hich means the last column #ill ha*e a
large run o& 6t6s
2he ith occurrence o& char a in the last column is the same
character in X as the ith occurrence o& a in the &irst column
$et C(a! "e the count o& s'm"ols in XQ0+n?2S that are
lexicographicall' smaller than a+ #here n is the length o&
X
$et c "e the current character+ sp "e the start o& our range on BG2
and ep "e the end
A (NA string+ or its BG2+ can "e stored using onl' t#o
"its per character
((0! J 0
((9! J 9
N 0$U1 is not a su"string
((2! J 9
Much &aster
Most de no*o
assem"lers use a data
structure called a de
BruiXn graph to hold
o*erlaps "et#een reads
-se a set o& hash ta"le indices to add each read to the
graph