Documentos de Académico
Documentos de Profesional
Documentos de Cultura
Mayo-Junio 2017
Reginald Morgan (ESG-4.3) Lu Ma Mi Ju Vi Sa Do
morganreginald@uniovi.es Bq3 09-11 Bq3 09-11
Tel. 98 510 4214 1 2 3 4 5 6 Bq1 12-14
Bq2 16-18
Bq1 12-14
Bq2 16-18 7
Bq3 09-11 Bq3 09-11 Bq3 09-11 Bq3 09-11
GRUPO 1 (13)
lvarez Freile, Jimena
GRUPO 2 (11)
Alonso Cordero, Andrs
GRUPO 3 (11)
Fernndez Jimnez, Diego
8 Bq1 12-14
Bq2 16-18 9 10 11 12 13
Bq1 12-14
Bq2 16-18
Bq1 12-14
Bq2 16-18
Bq1 12-14
Bq2 16-18 14
Barrientos lvarez, Octavio lvarez Alija, Nuria Fernndez Villabrille, Sara
Bq3 09-11 Bq3 10-12 Bq3 10-12
Delgado Rodrguez, Jaime
Gallego Mar.nez, Borja
Garca Cantera, Marina
,lvarez Gonzlez, Ana
Cano Menndez, Mnica
Escudero Cernuda, Sara
Garca Vega, Jorge
Gonzlez Ingelmo, Mara
Gonzlez Tolivia, Mara Esther
15 16 17 18 19 20
Bq1 12-14
Bq2 16-18
Bq1 12-14
Bq2 16-18
Bq1 12-14
Bq2 16-18 21
Menndez Fernndez, Iris Fernndez Borbolla, Andrs Guerra Garca, Mara Bq3 10-12 Bq3 09-11
Pedrosa Laza, Mara
Pearroya Rodrguez, Alfonso
Prez Amieva, Patricia
Fernndez Conty, Cristina Ana
HriardDubreuilh, Marine
Rodrguez Jardn, Marta
Gu6rrez Fernndez, Sara
Matesanz Snchez, Roco
Mndez Villalba, Laura
22 23 24 25 26 27
Bq1 12-14
Bq2 16-18
Bq1 12-14
Bq2 16-18 28
Prez Fuentes, Luca Ruiz Fernndez, Jess Pollino de Abia, Mnica Bq3 09-11 Bq3 09-11 PRESENT. PRESENT. PRESENT.
Snchez Fernndez, Rosala
Varela Fernndez, Saray
Villaverde Marn, Marina
Valle Tejn, Beatriz Rivero Peralta, Rodrigo
29 Bq1 12-14
Bq2 16-18 30 31 G1 1 G2 2 G3 3
Bq1 12-14
Bq2 16-18 4
GATCTACCATGAAAGACTTGTGAATCCAGGAAGAGAGACTGACTGGGCAACATGTTATTCAGGT
ACAAAAAGATTTGGACTGTAACTTAAAAATGATCAAATTATGTTTCCCATGCATCAGGTGCAATG PRCTICAS DE BIOINFORMTICA APLICADA A GENOMAS
GGAAGCTCTTCTGGAGAGTGAGAGAAGCTTCCAGTTAAGGTGACATTGAAGCCAAGTCCTGAA OBJETIVO: Transformar secuencias genticas en informacin til para revelar sus
AGATGAGGAAGAGTTGTATGAGAGTGGGGAGGGAAGGGGGAGGTGGAGGGATGGGGAATGGG estructuras, funcines, mecanismos, y papeles en la patofisiologa.
CCGGGATGGGATAGCGCAAACTGCCCGGGAAGGGAAACCAGCACTGTACAGACCTGAACAACG DATOS: Usar las secuencias de homlogos, datos qumicos, clnicos, literatura. Encontrar,
AAGATGGCATATTTTGTTCAGGGAATGGTGAATTAAGTGTGGCAGGAATGCTTTGTAGACACAGT organizar, evaluar y presentar los datos y resultados de anlisis comparativos
AATTTGCTTGTATGGAATTTTGCCTGAGAGACCTCATTGCAGTTTCTGATTTTTTGATGTCTTCATC
INTERPRETACIN: Permite inferir y predecir estructuras, interacciones, funciones,
CATCACTGTCCTTGTCAAATAGTTTGGAACAGGTATAATGATCACAATAACCCCAAGCATAATATT
mecanismos y papeles patofisiolgicos, basado en anlisis y modelos, como alineamientos,
TCGTTAATTCTCACAGAATCACATATAGGTGCCACAGTTATCCCCATTTTATGAATGGAGTGATGA
rboles filogenticos, pHMM (modelos ocultos de Markov), modelado y docking 3D.
AAACCTTAGGAATAATGAATGATTTGCGCAGGCTCACCTGGATATTAAGACTGAGTCAAATGTTG
GGTCTGGTCTGACTTTAATGTTTGCTTTGTTCATGAGCACCACATATTGCCTCTCCTATGCAGTTAA ESTRATGIA: Anlisis computacional de la evolucin y filogenia molecular (rboles) revelan el
GCAGGTAGGTGACAGAAAAGCCCATGTTTGTCTCTACTCACACACTTCCGACTGAATGTATGTAT historial y los mecanismos de divergencia de genes y especies, su clasificacin, fechas y
GGAGTTTCTACACCAGATTCTTCAGTGCTCTGGATATTAACTGGGTATCCCATGACTTTATTCTGA patrones funcionales.
CACTACCTGGACTTGTCAAATAGTTTGGACCTTGTCAAATAGTTTGGAGTCCTTGTCAAATAGTTT INTEGRACIN: Unir los resultados incorporando datos externos (SNPs, expresin, regulacin,
GGGGTTAGCACAGACCCCACAAGTTAGGGGCTCAGTCCCACGAGGCCATCCTCACTTCAGATGA sitios funcionales, propiedades, redes, fenotipos) en una presentacin escrita y defenderla.
CAATGGCAAGTCCTAAGTTGTCACCATACTTTTGACCAACCTGTTACCAATCGGGGGTTCCCGTA
EVALUACIN de las PRCTICAS
ACTGTCTTCTTGGGTTTAATAATTTGCTAGAACAGTTTACGGAACTCAGAAAAACAGTTTATTTTC
TTTTTTTCTGAGAGAGAGGGTCTTATTTTGTTGCCCAGGCTGGTGTGCAATGGTGCAGTCATAGC 1) Guardar y entregar los archivos de datos producidos o analizados (trabajo en pareja):
TCATTGCAGCCTTGATTGTCTGGGTTCCAGTGGTTCTCCCACCTCAGCCTCCCTAGTAGCTGAGA 1) bsqueda y seleccin de homlogos; 2) alineamiento(s) mltiples,
CTACATGCCTGCACCACCACATCTGGCTAGTTTCTTTTATTTTTTGTATAGATGGGGTCTTGTTGTG 3) arboles filogenticos; 4) pHMM y su logo; 5) modelado-3D (valor 50%).
TTGGCCAGGCTGGCCACAAATTCCTGGTCTCAAGTGATCCTCCCACCTCAGCCTCTGAAAGTGCT 2) Resumen escrito de 10-15 pginas con Introduccin (NCBI Gene, literatura de PubMed,
GGGATTACAGATGTGAGCCACCACATCTGGCCAGTTCATTTCCTATTACTGGTTCATTGTGAAGGA OMIM, etc.), objetivo, anotacin y interpretacin de los resultados (arriba) con citas y
TACATCTCAGAAACAGTCAATGAAAGAGACGTGCATGCTGGATGCAGTGGCTCATGCCTGTAATC bibliografa; Presentacin oral de 15 min de lo mismo en algunos casos. (valor 25%).
TCAGCACTTTGGGAGGCCAAGGTGGGAGGATCGCTTAAACTCAGGAGTTTGAGACCAGCCTGG 3) Examen de teora y prcticas en temas principales de la genmica y bioinformtica (ltima
GCAACATGGTGAAAACCTGTCTCTATAAAAAATTAAAAAATAATAATAATAACTGGTGTGGTGTT sesin, 1h, mayo-2017. valor 25%).
GTGCACCTAGAGTTCCAACTACTAGGGAAGCTGAGATGAGAGGATACCTTGAGCTGGGGACTGG
GENTICA
Omics Epigenome
DNA cromosmica, genes unidades funcionales
transcripcin a RNAs, algunos traducidos en proteinas
modificado por mutacin, indels, splicing
las diferencias definen variaciones y fenoptipos.
GENMICA
DNA heredado que define un organismo
los datos en forma se secuencias
modificado por mutacin, indels, reordenacin, metilacin
las diferencias definen el individual.
regulacin por TFs, ncRNAs, cromatina 3D
EPIGENMICA
herencia adquirida a la cromatina (TFs, ncRNAs, histonas, 3D)
Phenome los datos identifican los nucletidos/aa y sus modificaciones
modificado por enzimas de metilacin y/o acetilacin
los cambios alteran la regulacin de expresin del genoma.
PROTEMICA
estudios qumicos de las proteinas y sus isofomas
los datos son modificaciones post-traduccionales
mediado por enzimas (transferasas, etc.)
los cambios afectan interacciones y eficiencia.
Adapted from http://www.sciencebasedmedicine.org
BIOINFORMTICA
http://www.scientificpsychic.com/fitness/transcription.gif anlisis computacional para organizar y interpretar informcin
http://themedicalbiochemistrypage.org/images/hemoglobin.jpg los datos incluyen literatura, secuencias, qumicos, clnicos, etc.
http://upload.wikimedia.org/wikipedia/commons/c/c6/Clopidogrel_active_metabolite.png herramientas: datos, software, ordenadores, internet
http://creatia2013.files.wordpress.com/2013/03/dna.gif resultados cuentan, predicen, explican, diagnostican, resuelvan.
Regions of the gene outside of the CDS are called UTRs (untranslated regions), and are mostly
ignored by gene finders, though they are important for regulatory functions.
Richards (2011)
The Human Genome Guide
Figs. 12.11
lncRNA lincRNA
Small nuclear ribonucleic acid (snRNA): RNA splicing (removal of introns from hnRNA), regulation of
transcription factors or RNA polymerase II, and maintaining the telomeres. They are always associated with
specific proteins, and the complexes are referred to as small nuclear ribonucleoproteins (snRNP) often
pronounced "snurps". These elements are rich in uridine content.
Small nucleolar RNAs (snoRNAs). Small RNA molecules that play an essential role in RNA biogenesis and
guide chemical modifications of ribosomal RNAs (rRNAs) and other RNA genes (tRNA and snRNAs).
MicroRNAs (miRNAs) Post-transcriptional regulators that bind complementary sequences on target mRNA
transcripts (mRNAs), usually resulting in translational repression or target degradation and gene silencing.
Small interfering RNA (siRNA), short interfering RNA or silencing RNA, is a class of double-stranded RNA
molecules, 20-25 nucleotides in length, that play a role in the RNA interference (RNAi) pathway, where it
interferes with the expression of a specific gene.
Long noncoding RNA (lncRNAs) non-protein coding transcripts >200 nt, some intergenic (lincRNAs)..
Escritores
Marcas
M
e
Seales especficas
DNA
EL CDIGO EPIGENTICO La metilacin del DNA en islas CpG de los promotores
(modificaciones reversibles de DNA y histonas) est asociada con el silencimiento de un gen.
Las MARCAS estan reconocidos por LECTORES y BORRADORES:
complejos de proteinas de cromatina responsable de la me
estructura y funcin, independiente de las seales especficas me
me
de secuencia del DNA.
me
me
Borradores (e.g demethylasas, desacetylasas)
CpG density
Escritores Lectores
e.g. Ac/Me transferasas
Marcas
M
e
Gene
Histone H3 Lysine 9
( = H3K9 )
Homologue Relationships
Molecular profiling (pHMM)
Orthologues :
any gene pairwise relation where the ancestor node is a
Docking, Consurf, Dal CONSURF, CHIMERA
speciation event
SDP-Pred
DIVERGE Paralogues :
K-estimator
any gene pairwise relation where the ancestor node is a
Computational biology duplication event.
Functional information (tracing subfamily conservation,
in 3D context 3D: structural models divergence, rates and patterns)
Introduction to Sequence Analysis 1. BSQUEDA DE HOMLOGOS
Buscar en NCBI-GENE la especie referente y isoforma ms larga o comn y bajar el
FASTA.
a) Usar programa EBI-FASTA con Uniprot para obtener pairwise sequence scores.
Substitution matrices b) NCBI-BLASTP o PSI-BLAST con db NR busca dominios conservados.
c) PHMMER y JACKHMMER usan una proteina, Aln o HMM para buscar proteinas
con un perfil completo, y puede encontrar homlogos ms distantes en Uniprot.
Siempre pedir hasta 1000+ resultados para (sin cambios de otros parmetros) y
seleccionar las casillas de 150-200 homlogos (ortlogos, parlogos, outgroup) dispersos.
Guardar la seleccin de 150-200 bien distribuida en formato FASTA completo mirando
la cobertura, identidad/similtud, score alto, E-value y los alineamientos al fondo.
Se puede introducir nombres cortos en el FASTA para evitar conflictos con otros
programas. Guardar tambin la pgina de resultados de (la grfica), estadsticas y
alineamientos como texto con reformat results, para revisar si hace falta.
Tambin se puede guardar todos los datos en formatos FASTA completo y GenBank
para tener toda la informacin de las anotaciones.
Archivo en formato FASTA:
>NOMBRE Descripcin cambio de linea
Secuencia (aa o nt)
The empirical frequency with which aminoacid type i is replaced by type j (or >TYR-Hsa gi|4507753|ref|NP_000363.1| tyrosinase precursor [Homo sapiens]
viceversa) is written as Mi,j in the matrix: the probability of aligning two Ys in an ACDEFGHI ...
alignment YY/YY is 10+10=20, a very significant score, whereas that of YY/TP is 0- Incluir ortlogos y parlogos como homlogos, y algunas especies distantes que
5=-5. sirven como outgroup para la raiz de los rboles filogenticos.
Consultar EBI-PFAM, Superfam y Sanger-Ensembl para ver la familia completa.
E = Kmne-S or E = mn2-S
Alignments
your score
K = scale for search space
= scale for scoring system
expected number of
Scor random hits S = bitscore = (S - lnK)/ln2
e
(applies to ungapped alignments)
BLAST Output Domains, Graphics, Selectable Descriptions with BLASTP simple pairwise comparison of TBLASTN pairwise comparison of
2 protein sequences in FASTA format MAPT.PRO vs human chromosomes
Scores, Pairwise Alignments
1. Graphics Summary
PSI-BLASTP search at
http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastHo
me
ANXA10.PRO (Fasta format)
2. Descriptions
Output can be reformatted prior to Use pairwise alignments to
export evaluate/select results
3.Alignments
NCBI - Bases de Datos NCBI-Gene: Empezar aqu, seguir vnculos, seceuencias, mapas, etc.
MAPT
gene linkage
exon organization
alt. spliced transcripts
microtubule-binding domains
http://www.ncbi.nlm.nih.gov/snp/
http://www.ncbi.nlm.nih.gov/variation/
http://www.ncbi.nlm.nih.gov/omim/
NCBI BLAST versiones, mirar grficos, descriptores y sus parmetros, alineamientos.
Guardar resultados, download todos las secuencias (FASTA y GenBank) y de las seleccionadas
BLASTP simple pairwise comparison TBLASTN pairwise comparison of
of 2 protein sequences in FASTA MAPT.PRO vs human chromosomes
format
1. Resumen grfica
Profile HMMs
2. Nombres, cdigos, descripcines
Perfiles de Modelos ocultos de
Markov
3.Alineamientos
4. Seleccin de
hmlogos putativos,
amplia, diversa,
ortlogos y parlogos,
outgroup
Error
process Markov decisin
process 1. Estimate Dirichlet mixture density
posterior for each subfamily at each
position separately.
2. Use Dirichlet density posteriors to
weight contributions from other
Markov chain models the state of a system with a random variable that subfamilies.
changes through time, dependent on previous state. 345
12
Hidden Markov model is a Markov chain for which the state is only partially 67
observable (e.g. MSA position lacking some aa).
3. Compute amino acid distribution
Markov decision process is a Markov chain in which state transitions depend using weighted counts and standard
on the current state and an action vector that is applied to the system. Dirichlet procedure.
Brown et al,Subfamily HMMs in functional genomics (2005) Pacific Symposium on Biocomputing
Subfamily HMMs increase the separation
Algunos conceptos importantes
between true and false positives Surgen mtodos de comparacin de secuencias
515 unique SCOP folds 1.5% error rate in
para la bsqueda de homologas como son los
PFAM full MSAs subfamily classification patrones, perfiles (conjunto alineado de
Scored against Astral PDB90 using top-scoring SHMM secuencias que contiene un dominio) y HMM
(modelos estadsticos de la estructura primaria de
las secuencias).
Motivo: si observamos un alineamiento mltiple
de protenas homlogas veremos que algunas
columnas varan bastante, mientras que otras
estn ms conservadas. Cuando observamos
ciertas columnas cercanas con una alta
conservacin, es decir, cuando encontramos
trocitos de las secuencias que se conservan ms
que otros y que podran caracterizar
funcionalmente a las protenas, entonces solemos
hablar de MOTIVOS.
Una diferencia entre los perfiles y las expresiones regulares o patrones es que no
solo se limita a pequeas regiones con un alto ndice de similitud, sino que
presenta una mayor utilidad a la hora de definir regiones o dominios ms
extensos que puedan caracterizar familias de protenas ms que motivos. El perfil
puede cubrir tanto regiones conservadas como variables del alineamiento.
Una diferencia entre los perfiles y las expresiones regulares o patrones es que no
solo se limita a pequeas regiones con un alto ndice de similitud, sino que
presenta una mayor utilidad a la hora de definir regiones o dominios ms
extensos que puedan caracterizar familias de protenas ms que motivos. El perfil
puede cubrir tanto regiones conservadas como variables del alineamiento.
Algunos conceptos importantes Introduccin
3. Perfiles HMMs: Se muestran como una forma ms Los modelos de ocultos de Markov (HMM) surgieron
sensible, incluyendo los patrones reguladores y como una herramienta aplicada al procesamiento del
perfiles convencionales, de bsqueda de homlogos habla, un modelos estadstico que, a travs de un
remotos y dominios conservados basados en una algoritmo de aprendizaje, extraa las principales
descripcin estadstica de la estructura primaria caractersticas estocsticas de una cadena de habla.
consenso de una familia de protenas. Con la ingente cantidad de datos proveniente del
En el modelo HMMs que vamos a analizar secuenciamiento de distintos genomas, aparece un
consideramos tres estados posibles correspondientes problema adjunto -> cmo extraer de estos datos la
a la probabilidad de encontrar en dicha posicin un informacin subyacente.
determinado residuo, la probabilidad de insercin y Solucin: los HMM.
de delecin
encontrar los parmetros = (ei(.), aij) que Sea Vk(i) = max{1,,i-1} P[x1xi-1, 1, , i-1, xi, i = k]
maximizan P[ x | ]
= Probabilidad de la secuencia de estados ms
verosmil que termina en el estado i = k
VITERBI FORWARD
Inicializacin: Inicializacin:
V0(0) = 1 f0(0) = 1
Vk(0) = 0, para todo k > 0 fk(0) = 0, para todo k > 0
Es similar a alinear un conjunto de estados de una secuencia.
Complejidad temporal: O(K2N) K=n estados Iteracin: Iteracin:
Complejidad espacial: O(KN) N=longitud Vj(i) = ej(xi) maxk Vk(i-1) akj fl(i) = el(xi) k fk(i-1) akl
Terminacin: Terminacin
P(x, *) = maxk Vk(N) P(x) = k fk(N) ak0
Algoritmos de entrenamiento Algoritmos de entrenamiento
Tenemos un conjunto de secuencias de ejemplo del
tipo de las que queremos que el modelo ajuste
Objetivo: Dada una secuencia de observaciones,
(secuencias de entrenamiento), que suponemos encontrar el modelo ms probable que genere esa
independientes. secuencia
Si conociramos el camino de estados que recorri el Problema: No conocemos las frecuencias relativas de
modelo, los estados no estn ocultos (el HMM se los estados ocultos visitados.
transforma en una cadena de Markov), en la cual los No se conocen soluciones analticas
estimadores de mximoa verosimilitud para las Nos acercamos a la solucin por sucesivas
frecuencias de emisin y transicin se obtienen a aproximaciones.
partir de las frecuencias de observaciones. El problema ahora es la optimizacin, por lo que se
Si tenemos informacin (biolgica o fsica) que nos pueden usar muchas heursticas (simulated
aporte informacin previa a la distribucin de annealing, algoritmos genticos, etc)
probabilidades podemos agregrsela al modelo como
pseudocuentas.
Perfiles HMMs
SOFTWARE PARA PERFILES
HMM
Hay mltiples paquetes de software que
estn disponibles para implementar
perfiles HMM:
La Figura 2 muestra un HMM para un alineamiento de cuatro secuencias con tres posiciones.
La principal diferencia que existe entre ellos es la
arquitectura que adoptan: Un HMM est compuesto por
una serie de nodos o estados
cada uno de los cuales emite
smbolos (entre 4 o 20 posibles
aminocidos) con una Hay dos modelos diferenciados para el autor:
probabilidad dada.
Los estados estn conectados
secuencialmente existiendo Modelos de perfiles: modelos con estados de
probabilidades de transicin
entre ellos. Adems existen insercin y borrado asociados con cada estado
probabilidades de insercin y
borrado.
encontrado, permitiendo inserciones y borrados en la
BLOCKS y META-MEME
secuencia seleccionada.
representan los modelos de Modelos de motivos: modelos dominados por cadenas
motivos, los clsicos HMM .
de estados encontrados (modelando bloques sin
HMMER2 Plan7 y profile
HMM representan la nueva huecos de secuencias consenso), separados por un
generacin de perfiles HMM en
SAM, HMMER y PFTOOLS.
pequeo nmero de estados insertados modelando los
espacios entre los bloques sin huecos.
SAM y HMMER
SAM, HMMER, PFTOOLS y HMMpro Usan mezclas Dirichet en muchas distribuciones para
implementan modelos basados al ayudar al numero de parmetro libres. Si adoptan el
menos en una parte en los perfiles hibrido HMM/neural network techniques esto se
originales HMM de Krogh (1994). acenta.
Estos paquetes estn argumentados HMMER y PFTOOLS
en un simple modelo que trata con Son usados en primer lugar para construir bases de
mltiples dominios, secuencias datos de bsqueda de modelos donde estn presentes
alineadas y alineamientos locales. los alineamientos.
PROBE, META-MEME y BLOCKS
El alineamiento local o global no es
necesariamente esencial en el Asumen distintos modelos de motivos, los
alineamientos consisten en uno o mas bloques sin
algoritmo, pero esto demuestra que la
huecos, separados por secuencias intervening que son
probabilstica es una parte del modelo asumidas para ser aleatorias. PROBE y META-MEME
de arquitectura. adoptan modelos probabilsticos para los huecos.
GENEWISE
LIBRERIAS PARA PERFILES
Es una sofisticada aplicacin de
bsqueda por ventanas que puede HMM
tomar un HMMER de modelo de
protena. El software para perfiles HMM esta bien para:
Modelar una secuencia en particular de una familia de
inters.
Buscar secuencias homologas en una base de datos.
PSI-BLAST
Ahora necesitamos buscar una secuencia simple en una
No es una aplicacin HMM, pero usa librera de perfiles HMM.
los principios de los modelos Construir una librera requiere un largo nmero de
mltiples alineamientos de comunes dominios.
probabilsticos para construir HMM-
like models para mltiples
alineamientos.
perfiles HMM, se pueda suministrar una segunda lista 4. Exportar el concenso de las secuencias (aa mayoritaria) del HMM:
de stos que sean slidos, sensatos y HMMEMIT -o PROTOUT.TXT PROTNAME.HMM
estadsticamente basados en herramientas de anlisis,
que completen los anlisis BLAST y FASTA.
HMMER server Janelia Farm (HHMRI) http://hmmer.janelia.org/
ahora en EBI
Permits rigorous search using protein sequence or HMM vs curated databases. HMMSEARCH vs UniProt > 3000 results and taxonomy
Results by taxonomy (species domains), various formats, plus sequence logos.
No Hummer!
sequence database.
jackhmmer - iteratively search a query protein sequence, multiple
sequence alignment or profile HMM against the target protein sequence Gene identification, validation Multiple alignment Structure comparison, modelling
database.
hmmfetch retrieves one or more profile HMMs from a profile database (e.g. Therapeutics, drug discovery
LBD
ANXA11
6 T
21 TTTTTTTTTTTTTTTTTTTTTTTTTTT
22Leu 1.03e+00 7.08 Details
10
ANXA2 187
V 187Leu 7.04e-01 6.24 page
VVVVVVVVVVVVVVVVVVVVVVVVVVVVVTVTVVVV
ANXA3 T TTTTTTTTTTTTTTTTTTTTT
ANXA4 T TTTTTTTTTTTTTTTTTTTTTTTTT
ANXA5 T TTTTTTTTTTTTTTTTTTTTTTTTTTTTT
ANXA6a S SSSSSSSSSSSSSSSSS
ANXA6b T TTTTTTTTTTTTTTTTTT
ANXA7 T TTTTTTTTTTTTTTTTTT
pHMM versatility: # HMMSEARCH :: search profile(s) against a sequence database, # HMMER 3.1b1 (May 2013); http://hmmer.org/
#------------------------------------
# query HMM file: /cygdrive/d/seq/anxhmm/ANXFDOM84.hmm
- Searches aa-nt # target sequence database:
# output directed to file:
AnxBacteria139.lib
AnxBacteria139_ANXFDOM84_out.TXT
Increase the separation - Sequence logos ------- ------ ----- ------- ------ ----- ---- -- -------- -----------
1.4e-67 216.5 30.7 4.9e-24 77.4 0.5 6.4 5 gi|380728269|gb|AFE04271.1| hypothetical protein COCOR_01751
between true and false positives Markov chains supposed to be 4.9e-65 208.4 23.1 2e-21 69.0 0.2 6.0 5 gi|441490447|gb|AGC47142.1| hypothetical protein MYSTI_05866
1.4e-51 165.3 16.0 4.9e-20 64.6 1.3 6.7 5 gi|521998093|ref|WP_020509364.1| hypothetical protein [Actinoplan
independent but 1 site may 3.4e-40 129.0 26.7 4.5e-16 51.9 0.1 7.0 5 gi|380729140|gb|AFE05142.1|
2.7e-27 87.8 7.9 3.5e-16 52.3 0.1 3.2 2 gi|262078316|gb|ACY14285.1|
hypothetical protein COCOR_03294
Annexin repeat protein [Haliangi
Powerful modeling tool, versatile affect neighboring site 6.6e-21 67.4 0.4 1.4e-13 44.0 0.0 2.3 2 gi|406982961|gb|EKE04220.1|
1.3e-20 66.4 0.9 5.7e-19 61.2 0.2 2.6 2 gi|262077805|gb|ACY13774.1|
hypothetical protein ACD_20C0009
Annexin repeat protein [Haliangi
3.7e-20 65.0 0.6 5.3e-18 58.1 0.2 3.2 2 gi|262079386|gb|ACY15355.1| hypothetical protein Hoch_2831 [
input of training data 3.9e-19 61.7 0.4 1.2e-18 60.2 0.4 1.9 1 gi|262079485|gb|ACY15454.1| Annexin repeat protein [Haliangi
Convergence to true optimum 9.3e-18 57.3 0.1 1.6e-17 56.5 0.1 1.4 1 gi|516340685|ref|WP_017730718.1| hypothetical protein [Nafulsella
2.4e-16 52.8 0.4 5.4e-16 51.7 0.1 1.7 1 gi|497868487|ref|WP_010182643.1| hypothetical protein [Aquimarina
Modular for comining HMMs, may require a minimal number 2.1e-14 46.6 0.1 4.3e-14 45.6 0.0 1.6 1 gi|497867219|ref|WP_010181375.1| hypothetical protein [Aquimarina
4.1e-10 32.9 0.0 7.2e-10 32.1 0.0 1.4 1 gi|502689809|ref|WP_012925298.1| hypothetical protein [Spirosoma
makes data visually informative quality of MSA determine Domain annotation for each sequence (and alignments):
>> gi|380728269|gb|AFE04271.1| hypothetical protein COCOR_01751 [Corallococcus coralloides DSM 2259]
Incorporates prior knowledge accuracy and utility. # score bias c-Evalue i-Evalue hmmfrom hmm to alifrom ali to envfrom env to acc
--- ------ ----- --------- --------- ------- ------- ------- ------- ------- ------- ----
1 ! 77.4 0.5 2.4e-24 4.9e-24 5 67 .. 110 172 .. 106 173 .. 0.95
Fast, now comparable to BLAST == domain 1 score: 77.4 bits; conditional E-value: 2.4e-24
ANXFDOM84 5 klweavdglgtdEdavlkvlrgltpeqiaavakaYqkrYgkdlgddlkselsgdelkralell 67
+++++++g+gtdEd+++k+l+g+tpeqia+++++Yq++Ygk+l +++++el g++l+ra ll
searches gi|380728269|gb|AFE04271.1| 110 AMDGGMTGWGTDEDKIFKTLEGKTPEQIAMIRQSYQDHYGKNLDEKIRDELGGSDLQRAEGLL 172
6889*******************************************************9987 PP
JC69 0
Methods of Type of Data Traditional approaches
Reconstruction Distances Discrete characters
Neighbour-joining algorithm
Clustering UPGMA
algorithms Neighbor-Joining Tree searches with optimality criterion
Maximum parsimony
Optimization Maximum Parsimony Maximum likelihood
criterion Minimum Evolution Maximum Likelihood
Bayesian Bayesian Approaches
How to root? 2 4
A C
Cladogram
1 5
B 3 D
Phylogram
Using
outgroups
MrBayes 3.1
MrBayes 3.1: some options MrBayes will produce a population of trees and parameter values -
obtained by a Markov chain (mcmcmc). If the chain is working well
these will have converged to probable values
In practice we plot the results of an mcmcmc to determine the
region of the chain that converged to probable values. The burn
in is the region of the mcmcmc that is ignored for calculation of
the consensus tree
Trees and parameter values from the region of equilibrium are
used to estimate a consensus tree
The number of trees recovering a given clade corresponds to the
posterior for that clade, the probability that this clade exists
The mcmcmc uses the lnL function to compare trees between
generations
Support values for a given dataset and method are "posteriors"
typically higher then bootstrap and puzzle support values
Parsimony
Selection of (statistically) best-fit models of evolution In contrast to distance-based approaches, parsimony and
ProtTest3 http://darwin.uvigo.es/software/prottest3/ ML map the history of gene sequences onto a tree.
AIC (Akaike Information Criterion);
simple relationship between the likelihood and the number of parameters to estimate the distance of a model from truth.
In parsimony, the score is simply the minimum number of
BIC (Bayesian Information Criterion)
includes a penalty for the number of parameters to avoid overfitting of the selected model.
mutations that could possibly produce the data.
extremely popular, relatively fast. In ML, a hypothesis is judged by how well it predicts the
performs well when the divergence between sequences is low. observed data; the tree that has the highest probability of
good when evolutionary rates vary. Proven to construct the correct tree producing the observed sequences is preferred.
To use this approach, we must be able to calculate the
The first step in the algorithm is converting the DNA or protein sequences into probability of a data set given a phylogenetic tree.
a distance matrix that represents the evolutionary distance between
sequences. model of sequence evolution that describes the relative
probability of various events.
1 2 3 4 5
These probabilities take into account the possibility of
1 H959 -
unseen events.
2 H3847 0.00752 -
3 H5539 0.00809 0.01069 - From many perspectives, ML is the most appealing way to
4 H1067 0.00681 0.01593 0.01547 - estimate phylogenies. All possible mutational pathways that
5 H3368 0.00855 0.01126 0.01706 0.01505 - are compatible with the data are considered and the
likelihood function is known to be a consistent and powerful
A serious weakness for distance methods, is that the observed basis for statistical inference in general.
differences between sequences are not accurate reflections of the
evolutionary distances between them
Multiple substitutions (saturation of changes over time)
Principle Alpha parameter Scaling factor
Probability density
Search strategies: rarely exhaustive, mostly heuristic
NNI (Nearest neighbor interchanges)
TBR (Tree bisection-reconnection)
SPR (Subtree pruning and regrafting)
really exist in the true tree. It is much faster than Programs: Ugene-MrBayes; PhyloBayes (download, run parallel chains)
bootstrapping. http://www.phylogeny.fr
1.aLRT
2.Chi2: parametric branch support
3.aLRT-SH: non-parametric branch support based on a Shimodaira-Hasegawa-like procedure
4.aLRT Chi2 and SH: calculates parametric and non-parametric branch support; result is the minimum support of both Tree Tree Tree
methods topology 1 topology 2 topology 3
probability
1.0
Prior distribution
Data (observations)
probability
1.0
Posterior distribution
(MC)3
John Huelsenbeck
Bayesian inference
search for tree that maximizes the chance of seeing the tree given the data (P (Tree |
Data))
(MC)3
(MC)3
Swap of states
Bootstrapping and MCMC
Bootstrapping and MCMC generate a sample of trees
generate a sample of trees Note that MCMC yields a much larger sample of trees in the
same computational time, because it produces one tree for
every proposal cycle versus one tree per tree search in the
traditional approach.
However, the sample of trees produced by MCMC is highly
auto-correlated.
As a result, millions of cycles through MCMC are usually
required, whereas many fewer (of the order of 1,000) bootstrap
replicates are sufficient for most problems.
Bayesian methods are exciting because they allow complex
models of sequence evolution to be implemented.
estimating divergence times
finding the residues that are important to natural selection
detecting recombination points
ANX3'A6 (36) ANXA6 ANX ANX ANX ANX ANX ANX ANX ANX
57
ANXA10 (13) ANXA10 ANX ANX ANX ANX
32 ANX5'A6 (36) ANXA6 ANX ANX ANX ANX ANX ANX ANX ANX
77
53 ANXA5 (38) ANXA5 ANX ANX ANX ANX
KGD
includes both their common ancestor and all their descendents Frog gene 2
selected by the user to be used in a study, such as individuals, Use homologies, not analogies!
populations, species, genera, or bacterial strains - Homology: common ancestry of two or more character states
Spec ies D
Phylogenetic trees
How to root? 2 4
A C
1 5
B 3 D
Branch
Node Taxa (n) rooted unrooted
Using (2n-3)!/(2n-2(n-2)!) (2n-5)!/(2n-3(n-3)!)
outgroups
2 1 1
Root 3 3 1
4 15 3
Ingroup 5 105 15
search for tree that maximizes the chance of seeing the tree given the data (P Provides probabilities of support for clades (posterior probabilities)
(Tree | Data))
MrBayes 3.1 Explanation of Results
MrBayes 3.1: some options MrBayes will produce a population of trees and parameter values - obtained by a Markov
chain (mcmcmc). If the chain is working well these will have converged to probable
values.
In practice we plot the results of an mcmcmc to determine the region of the chain that
converged to probable values. The burn in is the region of the mcmcmc that is ignored for
calculation of the consensus tree
Trees and parameter values from the region of equilibrium are used to estimate a
consensus tree
The number of trees recovering a given clade corresponds to the posterior for that
clade, the probability that this clade exists
The mcmcmc uses the lnL (log likelihood) function to compare trees between
generations
Support values for a given dataset and method are "posteriors" typically higher than
bootstrap and puzzle support values.
(MC)3
From many perspectives, ML is the most appealing way to estimate phylogenies. All Search strategies: rarely exhaustive, mostly heuristic
possible mutational pathways that are compatible with the data are considered and the NNI (Nearest neighbor interchanges)
likelihood function is known to be a consistent and powerful basis for statistical inference in TBR (Tree bisection-reconnection)
general. SPR (Subtree pruning and regrafting)
Infinitely large alpha value, rate variation is the same for all sites
alpha = 1, extensive rate variation
alpha < 1, many invariable sites
Probability density
Relative evolutionary
rate
http://upload.wikimedia.org/wikipedia/commons/thumb/f/fc/Gamma_distribution_pdf.png
Key residue prediction using subfamily and
family-wide conservation analysis
Y221
W222
D558 Discover
R627
H745
bacterial
D628
G629
Y743 A744 annexins by
pHMM search,
alignment, tree,
3D modeling
Parker JS, Roe
SM, Barford D. , and analysis in
EMBO J., 2004
Tanaka Hall, T.
Chimera
D RD E YAH Structure 2005
SPECIAL TRICKS:
1. Incorporate electric potential data, using the web server for the Adaptive
Poisson-Boltzmann Solver (APBS): http://nbcr-222.ucsd.edu/pdb2pqr_2.0.0/
Enter a PDB code and select force-field parameters to process .pdb to files
PDB.pqr, PDB.propka and PDB.in. Continue by launching APBS to calculate
values in PDB.pqr to PDB-pot-PE0.dx(.gz) for use as input to Chimera.
In Biology, nothing
makes sense except in
the light of Evolution
Theodosius Dobzhansky,
ANXA9 ANXA10- Cytophaga ANXF hydrophobicity Russian geneticist
KGD+hydrophobicity Cys+hydrophobicity
It is not necessary to
accept everything as
true, one must only
ANXA9 electrostatic potential ANXA10 electrostatic potential Cytophaga ANXF electrostatics accept it as necessary
LCT (1927 aa) Lactasa Persistencia / deficiencia - 3 MAPT locus Neurodegenerative Human Itsara-2012 KANSL1 Microdeletion, duplication
Neandertal, Denisovan, Clovis
Diseases MAPT-link
POMC (267 aa) Enkephalin, Endorphin, Dynorphin Opiceos
4 Triplet Huntington-coding CAG Human Marmolino- HTT 3144 Etiology and consequences,
HBB (147 aa) beta hemoglobina, HbS, -
thalassemia, etc.
Variantes en afinidad a O2
Globinas expansion polyglutamine; 2012 aa; alternative mechanisms
CHRNA5 (468 aa) Neuronal Acetylcholine Receptor,
Nicotinic, Alpha 5
Contribucin al adiccin Friedreichs ataxia FXN 210 aa
MAP-Tau (< 758 aa) Neurodegeneracin Proteina malplegada y agregada
(intronic GAA)
en neuronas centrales
Pez de hielo 5 Viral Invasion, proliferation Human Longdon- BST Interaction specificity
STH (128 aa) Saitohin Gen dentro del gen MAPT
cocodrilo receptors 2014 (tetherin)
Taq-Pol (834 aa) DNA polymerase I (polA Enzima extremo
[Thermus thermophilus HB27] 6 Melanocortin Hair color, melanoma Vertebrates ASIP agouti Epigenetic basis
Trefoil TFF1,2,3 Trefoil factors, moco Papel de las cysteinas en la signalling 138 aa
(84, 129, 80 aa) mucosa
Type Gene Codon Normal/wildtype Pathogenic FRAXA (Fragile X syndrome) FMR1, on the X-chromosome CGG 6 - 53 230+
Non-PolyQ
FXTAS (Fragile X-associated tremor/ataxia syndrome) FMR1, on the X-chromosome CGG 6 - 53 55-200
FRAXA (Fragile X syndrome) FMR1, on the X-chromosome CGG 6 - 53 230+ Diseases
FRAXE (Fragile XE mental retardation) AFF2 or FMR2, on the X-chromosome GCC 6 - 35 200+
FXTAS (Fragile X-associated tremor/ataxia
FMR1, on the X-chromosome CGG 6 - 53 55-200
syndrome) FRDA (Friedreich's ataxia) 9q21 intron FXN or X25, (frataxin) GAA 7 - 34 100+
DM1 (Myotonic dystrophy) DMPK CTG 5 - 37 50+
AFF2 or FMR2, on the X-
FRAXE (Fragile XE mental retardation) GCC 6 - 35 200+ SCA8 (Spinocerebellar ataxia Type 8) OSCA or SCA8 CTG 16 - 37 110 - 250
chromosome
SCA12 (Spinocerebellar ataxia Type 12) PPP2R2B or SCA12 nnn On 5' end 7 - 28 66 - 78
FRDA (Friedreich's ataxia) 9q21 intron FXN or X25, (frataxin) GAA 7 - 34 100+
DM1 (Myotonic dystrophy) DMPK CTG 5 - 37 50+
SCA8 (Spinocerebellar ataxia Type 8) OSCA or SCA8 CTG 16 - 37 110 - 250
SCA12 (Spinocerebellar ataxia Type 12) PPP2R2B or SCA12 nnn On 5' end 7 - 28 66 - 78
Information mining: Epigenetic targets and mechanisms; Imprinting center; Syndrome etiology,
pathogenesis;
Bioinformatic analyses:
1. Blast search and retrieve ca. 100 diverse homologs of SNRPN (Blast,Hmmer). Compile FASTA library file
(save GenBank annotations separately).
2. Align proteins (ClustalO, Muscle, Cobalt), manually revise (Ugene) and export in various formats (aln, phy,
meg, afa) from BUGACO.
3. Analyze alignment to produce pHMM (HMMER 3.1b) and create sequence logo (Skylign). Interpret
conservation pattern to infer site-specific functionality (SDP also).
4. Perform phylogenetic analyses (Mega-NJ, RAxML, Bayesian) with bootstrapping and gamma rates to
confirm homology, distinguishing orthologs from paralogs, estimate rates, assess coherence with known
speciation order.
5. Retrieve 3D structure (RCSC-PDB) to model in Chimera. Incorporate physicochemical and evolutionary
information
Proteinas marinas fluorescentes en invertebrados y vertebrados