Está en la página 1de 29

SEQUENCE ALIGNMENT

Two Alignment Multiple Alignment

1
R V C P K I L M E C K K D S D C L A E C I C L E H G Y C G M V C P K I L M K C K H D S D C L L D C V C L E D I G Y C G V S 0 0.0% 0 0.0% 0 0.0% 1 25.0% 0 0.0%

2
R V C P K I L M E C K K D S D C L A E C I C L E H G Y C G M V C P K I L M K C K H D S D C L L D C V C L E D I G Y C G V S

3
R V C P K I L M E C K K D S D C L A E C I C L E H G Y C G M V C P K I L M K C K H D S D C L L D C V C L E D I G Y C G V S

4
R V C P K I L M E C K K D S D C L A E C I C L E H G Y C G M V C P K I L M K C K H D S D C L L D C V C L E D I G Y C G V S

5
R V C P K I L M E C K K D S D C L A E C I C L E H G Y C G M V C P K I L M K C K H D S D C L L D C V C L E D I G Y C G V S

Fundamental steps of the procedure leading to optimal 2 sequences alignment

n - 1 R V C P K I L M E C K K D S D C L A E C I C L E H G Y C G M V C P K I L M K C K H D S D C L L D C V C L E D I G Y C G V S n R V C P K I L M E C K K D S D C L A E C I C L E H G Y C G M V C P K I L M K C K H D S D C L L D C V C L E D I G Y C G V S n + 1 R V C P K I L M E C K K D S D C L A E C I C L E H G Y C G M V C P K I L M K C K H D S D C L L D C V C L E D I G Y C G V S n + 2 R V C P K I L M E C K K D S D C L A E C I C L E H G Y C G M V C P K I L M K C K H D S D C L L D C V C L E D I G Y C G V S

1 3.6%

18 62.1%

5 17.2%

2 6.9%

n + m -3
R V C P K I L M E C K K D S D C L A E C I C L E H G Y C G M V C P K I L M K C K H D S D C L L D C V C L E D I G Y C G V S 1 33.3%

n + m -2
R V C P K I L M E C K K D S D C L A E C I C L E H G Y C G M V C P K I L M K C K H D S D C L L D C V C L E D I G Y C G V S 0 0.0%

n + m -1
R V C P K I L M E C K K D S D C L A E C I C L E H G Y C G 0 M V C P K I L M K C K H D S D C L L D C V C L E D I G Y C G V S 0.0%

n
R V C P K I L M E C K K D S D C L A E C I C L E H - G Y C G M V C P K I L M K C K H D S D C L L D C V C L E D I G Y C G V S 22 7 3 %

GTTAATTGCAGCCTGTATGCCAGCGGCATCGGCAAGGATGGG ACGAGTTGGGTAGCC

1) 2)

V N C S L Y A S G I G K D G T S WV A
ATTGATTGCTCTCCGTACCTCCAA GTTGTAAGAGATGGTAACACCATGGTAGCC

I D C S P Y L Q - V V R D G N T M V A
UNITARY MATRIX

Comparison of the fragments of 1st and 2nd domain of chicken ovomucoid using unitary matrix, GCM, PAM250 and algorithm of genetic semihomology

V N C S L Y A S G I G K D G T S WV A I D C S P Y <L Q V V R> D G N T M V A
0 0 1 1 0 1 0 0 0 0 0 0 1 1 0 0 0 1 1 GENETIC CODE MATRIX
GTTAATTGCAGCCTGTATGCCAGCGGCATCGGCAAGGATGGG ACGAGTTGGGTAGCC ATTGATTGCTCTCCGTACCTC < CAA > GTTGTAAGAGATGGTAACACCATGGTAGCC

% SCORE
7/19 36.8

2 2 3 0 2 2 1 0 0 1 1 1 3 2 1 1 1 3 3 PAM250 SCORING

29/57 50.9

V N C S L Y A S G I G K D G T S WV A I D C S P Y L < Q >V V R D G N T M V A
1 1 2 2 0 2 0 0 0 1 0 1 2 2 1 1 0 2 2 GENETIC SEMIHOMOLOGY

42/97 43.3 42/89 47.2 20/38 52.6

V N C S L Y A S G I G K D G T S WV A I D C S P Y L < Q >V V R D G N T M V A
2 2 3 3 2 3 0 0 0 2 1 2 3 3 1 1 0 3 3 34/57 59.6

1) Contribution (%) of identical positions


PKILMECKKD 8 PKILMKCKHD 80% similar PKILMECKKD 2 SDCLLDCVCL 20% not similar

2) Length of the compared strings (sequences)


LCE 1 WCG 33.3% casual M V EI C I E P K I R C I K V C T K D E R I T C L I L D ET 8 M V Y WC P R R F M H C V H L K A G G C T C W C L R L D Y Y 2 6 % probably similar

What is important in the protein similarity search ?

3) Distribution of the identical positions along the analyzed sequence


MVEMICIEPKIRCIKVCTKDERITL 5 HVYYWRPERFMHTVKLKAGGCRCWL 20% casual MVEMIMAGDARCIKVCTKDERITCL 5 HHYYWMAGDAHTVQLKAGGCWCWAG 20% similar

4) Residues at conservative positions


MVCPKILMKCKHDSDCLLDCVCLED EDEGKRRTKREHFKESNLAAAFKEQ not similar MVCPKILMKCKHDSDTLLDCVCLED QNCPGPREWCFTTRMNDSSCACPQT similar

5) Structural/genetic similarity of the amino acids at non-conservative positions


Identity only MVCPKILMKCKHDSDCLLDCVCLED RLCRRLVKRCRKETECIVECICIDE Structural MVCPKILMKCKHDSDCLLDCVCLED RLCRRLVKRCRKETECIVECICIDE Genetic MVCPKILMKCKHDSDCLLDCVCLED RLCRRLVKRCRKETECIVECICIDE

The probability of randomly occurred minimum identity match (a is equal to declared or higher) is:

The sequence identity estimation procedure

n k nk k x ( x( x 1)) k =a Pan = 2n x
n

x the number of unit types in sequence (20 Where: for proteins; 4 for NA) n the sequence length (the number of compared position pairs) a the number of identical positions

Genetic conditioning of the amino acid replacement probabilities and spectrum in molecular evolution

Do the amino acids possess their pedigree ?


or...

Do they contain the information about their history (genealogy)?


or...

Can the amino acid mutational replacements described as Markovian processes ?

The Markov model assumes that the substitution probability of amino acid AA1 by AA2 is the same, regardless of what the initial residue AA1 was transformed from (AAx, AAy)

AAx AAy

AA1 AA1

Pa Pb

AA2 AA2

Pa = Pb
The currently used statistical algorithms are based on Markovian model of the amino acid replacement (they directly use stochastic matrices of replacement frequency indices)

PAM250 matrix of amino acid replacements

C S T P A G N D E Q H R K M I L V F Y W

12 0 -2 -3 -2 -3 -4 -5 -5 -5 -3 -4 -5 -5 -2 -6 -2 -4 0 -8 C

2 1 1 1 1 1 0 0 -1 -1 0 0 -2 -1 -3 -1 -3 -3 -2 S

Why tryptophane is here the most conservative residue?


3 0 1 0 0 0 0 -1 -1 -1 0 -1 0 -2 0 -3 -3 -5 T 6 1 -1 -1 -1 -1 0 0 0 -1 -2 -2 -3 -1 -5 -5 -6 P 2 1 0 0 0 0 -1 -2 -1 -1 -1 -2 0 -5 -3 -6 A 5 0 1 0 -1 -2 -3 -2 -3 -3 -4 -1 -5 -5 -7 G

2 2 1 1 2 0 1 -2 -2 -3 -2 -4 -2 -4 N

4 3 2 1 -1 0 -3 -2 -4 -2 -6 -4 -7 D

4 2 1 -1 0 -2 -2 -3 -2 -5 -4 -7 E

4 3 1 1 -1 -2 -2 -2 -5 -4 -5 Q

6 2 0 -2 -2 -2 -2 -2 0 -3 H

6 3 0 -2 -3 -2 -4 -4 2 R

5 0 6 -2 2 5 -3 4 2 6 -2 2 4 2 4 -5 0 1 2 -1 -4 -2 -1 -1 -2 -3 -4 -5 -2 -6 K M I L V

9 7 0 F

10 0 17 Y W

BLOSUM62 matrix of amino acid replacements


A R N D C Q E G H I L K M F P S T W Y V 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 A 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 R

6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 N

6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 D

9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 C

5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 Q

5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 E

6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 G

8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 H

4 2 -3 1 0 -3 -2 -1 -3 -1 3 I

4 -2 2 0 -3 -2 -1 -2 -1 1 L

5 -1 -3 -1 0 -1 -3 -2 -2 K

5 0 -2 -1 -1 -1 -1 1 M

6 -4 -2 -2 1 3 -1 F

7 -1 -1 -4 -3 -2 P

4 1 5 -3 -2 11 -2 -2 2 7 -2 0 -3 -1 4 S T W Y V

Replacemant Arg Lys according to the statistical interpretation using stochastical matrix indices

PAM250
BLOSUM62

3 2 2 3 3

Arg

BLOSUM35 BLOSUM45 BLOSUM100

Lys

Diagram of genetic relationships between amino acids


K K N N AGCU R 1 R S 3 2 T T T T I M I I V V V V S A A A A L L L L G G G P P P P L L F F R R R S S S S W C C G E E D D R Q Q H H Y Y

Diagram of of codon genetic relationships Diagram amino acid genetic relationships


K AAA K AAG N AAC N AAU R AGA AGCU 1 R AGG S AGC S AGU 3 2 T ACA T ACG T ACC T ACU I AUA M AUG I AUC I AUU V GUA V GUG V GUC V GUU A GCA A GCG A GCC A GCU L CUA L CUG L CUC L CUU G GGA G GGG G GGC G GGU P CCA P CCG P CCC P CCU L UUA L UUG F UUC F UUU E GAA E GAG D GAC D GAU R CGA R CGG R CGC R CGU S UCA S UCG S UCC S UCU Q CAA Q CAG H CAC H CAU UGA W UGG C UGC C UGU UAA UAG Y UAC Y UAU

Arginine-to-lysine mutational conversion pathways for arginines of different origin

Met
AUG

Arg
AGG

Lys
AAG

His
CAC

Asn
AAC

Pro
CCC

Arg
CGC

Ser
AGC

Arg
AGG

Lys
AAG

Arg
CGG

Gln
CAG

Possible single-point-mutational processing of serine with respect to its origin


Trp
UGG

Asn
AAU

Ser
UCG

Ser
AGU

Thr Ser

Ala Trp
(UAG)

Pro Leu

Thr Ser

Ile Arg Gly

Asn Cys

Amino acid mutational substitution based on the single transition/transversion is NOT the Markovian process
Theoretical proof The conversion pathway of arginine into lysine, glutamine and serine for arginine resulting from the processing of the codons encoding different amino acids
Possible codons for arginine: AGA AGG CGA CGG CGC CGT

Conversion of arginine into lysine

Met
ATG

Arg
AGG

Lys
AAG

Gln Leu
CTR

Arg
CGR

CAR

Lys
AAR

Arg
AGR

Ser
AGY

His
CAY

Arg
CGY

Arg
AGR

Lys
AAR

Arg
CGR

Conversion of arginine into serine

Met
ATG

Arg
AGG

Ser
AGY

Arg Leu
CTR

Arg
CGR

AGR

Ser
AGY

Arg
CGY

His
CAY

Arg
CGY

Ser
AGY

Conversion of arginine into glutamine


Lys Met
ATG

Arg
AGG

AAG

Gln
CAG

Arg
CGG

Leu
CTR

Arg
CGR

Gln
CAR

His His
CAY

Arg
CGY

CAY

Gln
CAR

Arg
CGR

then...

Probability of the replacement of one amino acid into another depends significantly on what amino acids occupied that position in the past

There is a high risk, that commonly used algorithms applying the stochastic data matrices (MDM, PAM, BLOSUM) lead to the wrong interpretation of mutational processes occurring in proteins

Genetic relationhips between Arg and Met/Gln


K K N N AGCU R 1 G E E D D

Q Q
H H

Y Y

R
G

R
S S T T T T I V A

R
G G P R R S P A A L P P L L V L V L

W C C

3 2

S S S

M
I I

L F F

Arg-Met and Arg-Gln substitutions. Two kinds of arginine


Inhibitory z rolin dyniowatych
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 1 6. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. RVMIG RVMIG S C P RKL I [LW ][Y] MNK REKQ P C KSQ T [KSRHQ TY][V] DN RSDA D C LFMP ALTPG R DEG Q K C VITKR C LKG Q MV PKEQ RSA NHEQ DS [I][D] GE YFIH C G * * 47. 48. 49. 50. 51. 52. 53. 54. 55. 56. 57. 58. 59. 60. 61. 62. 63. 64. 65. 66. 67. 68. 69. 70. 71. 72. 73. 74. 75. 76. 77. 78.

Inhibitory typu Bowmana-Birk


C C DRBSN Q HELZRSIFTK # C ASTKEMILRDVPF * C T [KR][A] S NMIEKRDQ *# P P Q KZETI C [RHQ S][V] # C STNVAEHR DZBN MILVTR * R L NDE SKTR C H S A C KSDEN SLG RTFH C 79. IAVLM 80. C 81. ATNR 82. LYFRK 83. S 84. YIEFMQ DN 85. P 86. AG P 87. Q KZM 88. C 89. FVRIHSQ 90. C 91. VTBG LAYF 92. DB 93. [IMTV][Q ] 94. TNBKAHD 95. DBNKT 96. FSY 97. C 98. [YH][T] 99. EAKPD 100. PSAK 101. C

Domeny owomukoidu (typ Kazala )


1. 2. 3. 4. 5. 6. 7. 8. 9. 10. # 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. VILE NDH C [STR][D] LPKQ E YF ALPKQ SQ TK G TRS IVKNT G VSTL KRTQ DG N G TNRKE STLAQ P W MLIV VTI [A][R] C PT [RM][F] [NI][E] [L][Y] KSLQ DV [P][E] [V][H] C GA TS DN GS 33. SFV 34. T 35. Y 36. SDA 37. NS 38. [ED][R] 39. C 40. G STF 41. ILF 42. C 43. [L][A][N] 44. [YH][A] # 45. NY 46. RAILV 47. EQ 48. HQ LS 49. G HRN 50. ATR 51. [NHST][E] 52. VIL 53. ESKAG N 54. [K][L] * 55. ELKSRV 56. [YHS][K] 57. [DN][M] 58. G A 59. EKRA 60. C 61. RKE 62. PLQ E 63. KERD 64. [ISV][H] 65. [VG ][PT] 66. [MEK][PS]

PAM250 matrix of amino acid replacements

C S T P A G N D E Q H R K M I L V F Y W

12 0 -2 -3 -2 -3 -4 -5 -5 -5 -3 -4 -5 -5 -2 -6 -2 -4 0 -8 C

2 1 1 1 1 1 0 0 -1 -1 0 0 -2 -1 -3 -1 -3 -3 -2 S

3 0 1 0 0 0 0 -1 -1 -1 0 -1 0 -2 0 -3 -3 -5 T

6 1 -1 -1 -1 -1 0 0 0 -1 -2 -2 -3 -1 -5 -5 -6 P

2 1 0 0 0 0 -1 -2 -1 -1 -1 -2 0 -5 -3 -6 A

5 0 1 0 -1 -2 -3 -2 -3 -3 -4 -1 -5 -5 -7 G

2 2 1 1 2 0 1 -2 -2 -3 -2 -4 -2 -4 N

4 3 2 1 -1 0 -3 -2 -4 -2 -6 -4 -7 D

4 2 1 -1 0 -2 -2 -3 -2 -5 -4 -7 E

4 3 1 1 -1 -2 -2 -2 -5 -4 -5 Q

6 2 0 -2 -2 -2 -2 -2 0 -3 H

6 3 0 -2 -3 -2 -4 -4 2 R

5 0 6 -2 2 5 -3 4 2 6 -2 2 4 2 4 -5 0 1 2 -1 -4 -2 -1 -1 -2 -3 -4 -5 -2 -6 K M I L V

9 7 0 F

10 0 17 Y W

PAM250 and BLOSUM62 scores for the replacements:


Arg-Lys Lys-Gln Lys-Glu Arg-Gln and Arg-Glu

Replacement Arg/Lys Lys/Gln Arg/Gln Lys/Glu Arg/Glu

PAM250 3 1 1 0 -1

BLOSUM62 2 1 1 1 0

Genetic relationships among Arg, Lys, Glu and Gln


K K
N N AGCU

E E
D D G

Q Q
H H

Y Y

R
1

R
G

R
S S T T T T I M I I V A

R
G G P R R S P A A L P P L L V V L L

W C C

3 2

S S S

L F F

Arg-Glu and Lys-Glu substitutions (Arg/Lys/Gln/Glu replacements)


Inhibitory z rolin dyniowatych
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. RVMIG RVMIGS C P RKL I [LW][Y] MNK REKQP C KSQT [KSRHQTY][V] DN RSDA D C LFMP ALTPGR DEGQK C VITKR C LKGQMV PKEQRSA NHEQDS [I][D] GE YFIH C G 47. 48. 49. 50. 51. 52. 53. 54. 55. 56. 57. 58. 59. 60. 61. 62. 63. 64. 65. 66. 67. 68. 69. 70. 71. 72. 73. 74. 75. 76. 77. 78.

Inhibitory typu Bowmana-Birk


C C DRBSN QHELZRSIFTK C ASTKEMILRDVPF C T [KR][A] S NMIEKRDQ P P QKZETI C [RHQS][V] C STNVAEHR ! DZBN MILVTR R L NDE SKTR C H S A C KSDEN SLGRTFH C 79. IAVLM 80. C 81. ATNR 82. LYFRK 83. S 84. YIEFMQDN 85. P 86. AGP 87. QKZM 88. C 89. FVRIHSQ 90. C 91. VTBGLAYF 92. DB 93. [IMTV][Q] 94. TNBKAHD 95. DBNKT 96. FSY 97. C 98. [YH][T] 99. EAKPD 100. PSAK 101. C

Domeny owomukoidu (typ Kazala)


1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. VILE NDH C [STR][D] LPKQE YF ALPKQ SQTK GTRS IVKNT GVSTL KRTQ DGN G TNRKE STLAQP WMLIV VTI [A][R] C PT [RM][F] [NI][E] [L][Y] KSLQDV [P][E] [V][H] C GA TS DN GS 33. SFV 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. 47. 48. 49. 50. 51. 52. 53. 54. 55. 56. 57. 58. 59. 60. 61. 62. 63. 64. 65. T Y SDA NS [ED][R] C GSTF ILF C [L][A][N] [YH][A] NY RAILV EQ HQLS GHRN ATR [NHST][E] VIL ESKAGN [K][L] ELKSRV [YHS][K] [DN][M] GA EKRA C RKE PLQE KERD [ISV][H] [VG][PT] 66. [MEK][PS]

What part of the codon contains the information about the previous amino acid that occurred at certain position of the protein sequence?

At most 2/3 of the entire codon.

Ala
GCG

Val
GUG

How long is the information about codons of preceeding amino acids stored?
The shortest storage period is 3 transitions/transversions

Ala
GCG

Val
GUG

Met
AUG

Ile
AUA

Ser
UCC

Ser
UCU

Thr
ACU

Ser
AGU

Theoreticaly the longest period is infinite

Lys
AAA

Asn
AAC

Asp
GAC

His
CAC

Gln
CAG

Glu
GAG

Asp
GAU

Tyr
UAU

His
CAU

Asn
AAU

Lys
AAG

Gln
CAG

His
CAC

...

CONCLUSIONS

The analysis of genetic semihomology excludes applicability of Markov model for the studies on protein variability at the amino acid level.

The amino acid codons do contain the information about the ancestral amino acids, whose codons were the starting point to the codon of current residue.

It refers mainly to the positions undergoing single-point mutations as the most basic mechanism of evolutionary variability.

También podría gustarte