Documentos de Académico
Documentos de Profesional
Documentos de Cultura
BLOSUM Background
Prosite data base: dictionary of sites and patterns in proteins; linked to Swiss-Prot
database
Goal is to identify biologically significant patterns in protein families (with
special emphasis on those regions thought to be important to protein function)
Tries to find good discriminators that emphasize reliable identification of known
family members while excluding known non-members
Prosite patterns: signature motifs
Example: Helicase proteins
involved in unwinding and opening of DNA strands in preparation for transcription
Werners syndrome: mutation in helicase causes affected individuals to age at a
an accelerated rate
Hundreds of helicases from different organisms have been sequenced; much of
what we know about how they work comes from computer-assisted analysis of
these sequences
(regular expression)
a
a
a
c
b
b
c
b
c
a
c
-
b
b
b
a
a
c
Corresponding profile:
a
b
c
-
1
0.75
3
0.25
5
0.50
0.75
0.75
0.25 0.25 0.50
0.25
0.25 0.25 0.25
Why blocks?
Need to have a multiple alignment; easier to align with similar sequences
Dont want insertions and deletions to complicate estimation of substitution
probabilities
Interested in detecting conserved regions of protein sequences, so restrict attention to
these regions when computing the scoring matrix
c (kij )
Just as with the PAM matrix, we will compute the BLOSUM score as the (log) ratio of the
AA
4
4
AB
4
1
AC
4
1
BB
1
1
BC
1
1
CC
1
1
4(4-1)/2 = 6
(4)(1) = 4
(4)(1) = 4
(1)(1-1)/2 = 0
(1)(1) = 1
(1)(1-1)/2 = 0
(k )
ii
ni
=
2
)
c (k
ij = n i n j
T = c = w n(n 1)
ij
2
i j
qij =
c ij
T
qAB =
4 + 8 + 0 + 0 + 0 + 0 + 0 12
=
(6)(5)
105
7
2
AABCD---BBCDA
DABCD-A-BBCBB
BBBCDBA-BCCAA
AAACDC-DCBCDB
CCBADB-DBBDCC
AAACA---BBCCC
qij
pi = qii +
ji 2
5. The desired denominator is the expected frequency for each pair (assuming
independence):
eii = pi2
eij = 2 pi p j
(i j)
6. Each entry for (i,j) in the log odds matrix is then equal to qij/eij
7. Log odds
ratio:
qij
sij = log 2
eij
8. Value stored for BLOSUM = 2 sij, rounded to nearest integer (half bit units)
A
S
T
T
A
A
A
A
A
A
I
L
L
V
L
A
I
L
S
T
V
A
1 +10
I
0
3
2
1
3
0
(5)(4)
T = c ij = 3
= 30
2
i j
A
11
30
6
pA = 11+ 30 = 14 30 = 0.466
4
pI = 0 + 30 = 2 30 = 0.066
6
pL = 3 + 30 = 6 30 = 0.2
4
pS = 0 + 30 = 2 30 = 0.066
2
6
pT = 1+ 30 = 4 30 = 0.133
2
4
pV = 0 + 30 = 2 30 = 0.066
0
3
Vector of pi values:
30
30
30
30
0
0
2
A
I
L
S
T
V
A
I
0.366
0
0
0
0.1
0.066
0
0.133
0
0
0.033
30
30
30
0
30
0.1
0
0
0.1
0.066 0.033
0
0
(14 30)
2(14 30)( 2 30)
2(14 )( 6 )
30 30
2(14 30)( 2 30)
2(14 )( 4 )
30 30
2(14 30)( 2 30)
(2 30)
2( 2 )( 6 )
30 30
2( 2 30)( 2 30)
2( 2 )( 4 )
30 30
2( 2 30)( 2 30)
(6 30)
2( 6 30)( 2 30)
2( 6 )( 4 )
30 30
2( 6 30)( 2 30)
(2 30)
2( 2 )( 4 )
30 30
2( 2 30)( 2 30)
(4 30)
2( 4 30)( 2 30)
(2 30)
0.366
e.g., sAA = log 2
2 = log 2 1.6837 = 0.7516
14
30
Full matrix:
A
2
I ?
L ?
S 0
T 0
V ?
T V
?
4
?
?
4
3
?
?
4
?
4
?
2
?
Note: undefined values result from unobserved pairs (would ordinarily not happen with real data)
Relative entropy
H = qij sij
i j
PAM100
PAM120
PAM160
PAM200
PAM250
==>
==>
==>
==>
==>
Blosum90
Blosum80
Blosum60
Blosum52
Blosum45