Está en la página 1de 47

EBT4-2017 Bioinformtica mica

Grado en Biotecnologa (GBIOTE01) Prcticas de Laboratorio (30 h / grupo)


Aula de informtica (Fac. Biologa planta 1)
Experimentacin en Biotecnologa IV
Universidad de Oviedo, Curso 3 2017

Mayo-Junio 2017
Reginald Morgan (ESG-4.3) Lu Ma Mi Ju Vi Sa Do
morganreginald@uniovi.es Bq3 09-11 Bq3 09-11
Tel. 98 510 4214 1 2 3 4 5 6 Bq1 12-14
Bq2 16-18
Bq1 12-14
Bq2 16-18 7
Bq3 09-11 Bq3 09-11 Bq3 09-11 Bq3 09-11
GRUPO 1 (13)
lvarez Freile, Jimena
GRUPO 2 (11)
Alonso Cordero, Andrs
GRUPO 3 (11)
Fernndez Jimnez, Diego
8 Bq1 12-14
Bq2 16-18 9 10 11 12 13
Bq1 12-14
Bq2 16-18
Bq1 12-14
Bq2 16-18
Bq1 12-14
Bq2 16-18 14
Barrientos lvarez, Octavio lvarez Alija, Nuria Fernndez Villabrille, Sara
Bq3 09-11 Bq3 10-12 Bq3 10-12
Delgado Rodrguez, Jaime
Gallego Mar.nez, Borja
Garca Cantera, Marina
,lvarez Gonzlez, Ana
Cano Menndez, Mnica
Escudero Cernuda, Sara
Garca Vega, Jorge
Gonzlez Ingelmo, Mara
Gonzlez Tolivia, Mara Esther
15 16 17 18 19 20
Bq1 12-14
Bq2 16-18
Bq1 12-14
Bq2 16-18
Bq1 12-14
Bq2 16-18 21
Menndez Fernndez, Iris Fernndez Borbolla, Andrs Guerra Garca, Mara Bq3 10-12 Bq3 09-11
Pedrosa Laza, Mara
Pearroya Rodrguez, Alfonso
Prez Amieva, Patricia
Fernndez Conty, Cristina Ana
HriardDubreuilh, Marine
Rodrguez Jardn, Marta
Gu6rrez Fernndez, Sara
Matesanz Snchez, Roco
Mndez Villalba, Laura
22 23 24 25 26 27
Bq1 12-14
Bq2 16-18
Bq1 12-14
Bq2 16-18 28
Prez Fuentes, Luca Ruiz Fernndez, Jess Pollino de Abia, Mnica Bq3 09-11 Bq3 09-11 PRESENT. PRESENT. PRESENT.
Snchez Fernndez, Rosala
Varela Fernndez, Saray
Villaverde Marn, Marina
Valle Tejn, Beatriz Rivero Peralta, Rodrigo
29 Bq1 12-14
Bq2 16-18 30 31 G1 1 G2 2 G3 3
Bq1 12-14
Bq2 16-18 4

GATCTACCATGAAAGACTTGTGAATCCAGGAAGAGAGACTGACTGGGCAACATGTTATTCAGGT
ACAAAAAGATTTGGACTGTAACTTAAAAATGATCAAATTATGTTTCCCATGCATCAGGTGCAATG PRCTICAS DE BIOINFORMTICA APLICADA A GENOMAS
GGAAGCTCTTCTGGAGAGTGAGAGAAGCTTCCAGTTAAGGTGACATTGAAGCCAAGTCCTGAA OBJETIVO: Transformar secuencias genticas en informacin til para revelar sus
AGATGAGGAAGAGTTGTATGAGAGTGGGGAGGGAAGGGGGAGGTGGAGGGATGGGGAATGGG estructuras, funcines, mecanismos, y papeles en la patofisiologa.
CCGGGATGGGATAGCGCAAACTGCCCGGGAAGGGAAACCAGCACTGTACAGACCTGAACAACG DATOS: Usar las secuencias de homlogos, datos qumicos, clnicos, literatura. Encontrar,
AAGATGGCATATTTTGTTCAGGGAATGGTGAATTAAGTGTGGCAGGAATGCTTTGTAGACACAGT organizar, evaluar y presentar los datos y resultados de anlisis comparativos
AATTTGCTTGTATGGAATTTTGCCTGAGAGACCTCATTGCAGTTTCTGATTTTTTGATGTCTTCATC
INTERPRETACIN: Permite inferir y predecir estructuras, interacciones, funciones,
CATCACTGTCCTTGTCAAATAGTTTGGAACAGGTATAATGATCACAATAACCCCAAGCATAATATT
mecanismos y papeles patofisiolgicos, basado en anlisis y modelos, como alineamientos,
TCGTTAATTCTCACAGAATCACATATAGGTGCCACAGTTATCCCCATTTTATGAATGGAGTGATGA
rboles filogenticos, pHMM (modelos ocultos de Markov), modelado y docking 3D.
AAACCTTAGGAATAATGAATGATTTGCGCAGGCTCACCTGGATATTAAGACTGAGTCAAATGTTG
GGTCTGGTCTGACTTTAATGTTTGCTTTGTTCATGAGCACCACATATTGCCTCTCCTATGCAGTTAA ESTRATGIA: Anlisis computacional de la evolucin y filogenia molecular (rboles) revelan el
GCAGGTAGGTGACAGAAAAGCCCATGTTTGTCTCTACTCACACACTTCCGACTGAATGTATGTAT historial y los mecanismos de divergencia de genes y especies, su clasificacin, fechas y
GGAGTTTCTACACCAGATTCTTCAGTGCTCTGGATATTAACTGGGTATCCCATGACTTTATTCTGA patrones funcionales.
CACTACCTGGACTTGTCAAATAGTTTGGACCTTGTCAAATAGTTTGGAGTCCTTGTCAAATAGTTT INTEGRACIN: Unir los resultados incorporando datos externos (SNPs, expresin, regulacin,
GGGGTTAGCACAGACCCCACAAGTTAGGGGCTCAGTCCCACGAGGCCATCCTCACTTCAGATGA sitios funcionales, propiedades, redes, fenotipos) en una presentacin escrita y defenderla.
CAATGGCAAGTCCTAAGTTGTCACCATACTTTTGACCAACCTGTTACCAATCGGGGGTTCCCGTA
EVALUACIN de las PRCTICAS
ACTGTCTTCTTGGGTTTAATAATTTGCTAGAACAGTTTACGGAACTCAGAAAAACAGTTTATTTTC
TTTTTTTCTGAGAGAGAGGGTCTTATTTTGTTGCCCAGGCTGGTGTGCAATGGTGCAGTCATAGC 1) Guardar y entregar los archivos de datos producidos o analizados (trabajo en pareja):
TCATTGCAGCCTTGATTGTCTGGGTTCCAGTGGTTCTCCCACCTCAGCCTCCCTAGTAGCTGAGA 1) bsqueda y seleccin de homlogos; 2) alineamiento(s) mltiples,
CTACATGCCTGCACCACCACATCTGGCTAGTTTCTTTTATTTTTTGTATAGATGGGGTCTTGTTGTG 3) arboles filogenticos; 4) pHMM y su logo; 5) modelado-3D (valor 50%).
TTGGCCAGGCTGGCCACAAATTCCTGGTCTCAAGTGATCCTCCCACCTCAGCCTCTGAAAGTGCT 2) Resumen escrito de 10-15 pginas con Introduccin (NCBI Gene, literatura de PubMed,
GGGATTACAGATGTGAGCCACCACATCTGGCCAGTTCATTTCCTATTACTGGTTCATTGTGAAGGA OMIM, etc.), objetivo, anotacin y interpretacin de los resultados (arriba) con citas y
TACATCTCAGAAACAGTCAATGAAAGAGACGTGCATGCTGGATGCAGTGGCTCATGCCTGTAATC bibliografa; Presentacin oral de 15 min de lo mismo en algunos casos. (valor 25%).
TCAGCACTTTGGGAGGCCAAGGTGGGAGGATCGCTTAAACTCAGGAGTTTGAGACCAGCCTGG 3) Examen de teora y prcticas en temas principales de la genmica y bioinformtica (ltima
GCAACATGGTGAAAACCTGTCTCTATAAAAAATTAAAAAATAATAATAATAACTGGTGTGGTGTT sesin, 1h, mayo-2017. valor 25%).
GTGCACCTAGAGTTCCAACTACTAGGGAAGCTGAGATGAGAGGATACCTTGAGCTGGGGACTGG
GENTICA
Omics Epigenome
DNA cromosmica, genes unidades funcionales
transcripcin a RNAs, algunos traducidos en proteinas
modificado por mutacin, indels, splicing
las diferencias definen variaciones y fenoptipos.
GENMICA
DNA heredado que define un organismo
los datos en forma se secuencias
modificado por mutacin, indels, reordenacin, metilacin
las diferencias definen el individual.
regulacin por TFs, ncRNAs, cromatina 3D
EPIGENMICA
herencia adquirida a la cromatina (TFs, ncRNAs, histonas, 3D)
Phenome los datos identifican los nucletidos/aa y sus modificaciones
modificado por enzimas de metilacin y/o acetilacin
los cambios alteran la regulacin de expresin del genoma.
PROTEMICA
estudios qumicos de las proteinas y sus isofomas
los datos son modificaciones post-traduccionales
mediado por enzimas (transferasas, etc.)
los cambios afectan interacciones y eficiencia.
Adapted from http://www.sciencebasedmedicine.org
BIOINFORMTICA
http://www.scientificpsychic.com/fitness/transcription.gif anlisis computacional para organizar y interpretar informcin
http://themedicalbiochemistrypage.org/images/hemoglobin.jpg los datos incluyen literatura, secuencias, qumicos, clnicos, etc.
http://upload.wikimedia.org/wikipedia/commons/c/c6/Clopidogrel_active_metabolite.png herramientas: datos, software, ordenadores, internet
http://creatia2013.files.wordpress.com/2013/03/dna.gif resultados cuentan, predicen, explican, diagnostican, resuelvan.

Different Facets of Genomic Analysis


What is Bioinformatics?
STRUCTURAL GENOMICS Identify and define (annotate) all physical features.
- Involves DNA sequencing technologies, high-throughput expression arrays,
- Utilizes genetic maps (cytogenetic, physical), orthologous-paralogous gene relationships, Definition: Conceptualizing biology in terms of
population variation (SNPs), genomic rearrangements by duplication or deletion, modification molecules and then applying informatics
by methylation, acetylation, etc.. techniques from math, computer science, and
statistics to organize and understand the abundant,
FUNCTIONAL GENOMICS
associated information.
- Empirical studies compare control vs test simple, by sequence, function and phenotype.
Uses NGS sequencing, RNAseq, genetic mapping, chips vs ChIP,
- Bioinformatics detects, infers, predicts functional elements & consequences.
Goals of bioinformatics
COMPARATIVE GENOMICS relate structural change to functional consequence TECHNICAL bioinformatics: Develop informatics tools
- Experimental or clinical analysis and assessment by pairs or populations. (software) to efficiently access and manage
- Bioinformatics (evolutionary conservation, phylogenetic trees, profile hidden Markov models, scientific databases of diverse information.
specificity determinants, SNPs APPLIED bioinformatics: Analyze, model, interpret,
EPIGENOMICS infer and predict information relevant to
- DNA methylation, histone modif., chromatin remodeling histone code. deciphering DNA, RNA and protein sequences into
biological structures, evolution, function and
INTEGRATIVE GENOMICS
pathophysiological roles. Eg. Identify genes, RNA
- Study of functional networks, expression regulation, protein-RNA interactions.
intermediates and protein products, and determine
DISEASE GENOMICS - distinguish normal variation from pathological deviation. their biological significance based on structure,
- Population statistics, pharmacogenomics, chart chromosomal deletions, duplications, function & evolution.
rearrangements.
RESUMEN de
ANALISIS BIOINFORMATICO de SECUENCIAS Reference textbooks & selected journal articles available
PRCTICAS

Seleccionar un gen o patologa de inters de una lista o de cuenta propia.


Identificar el nombre de la secuencia y extraier la protena humana completa (GenPept y
FASTA) y el cDNA (GenBank y FASTA) de NCBI Gene o NCBI- OMIM para conocer las
regiones estructurales y su organizacin.
Buscar protenas homlogos (150 orthlogos, ejemplares de los parlogos y un outgroup
fuera de estos grupos, con NCBI-BLASTP vs NR y con PHMMER vs UniProt (servidor de EBI).
Crear un alineamiento del archivo de secuencias FASTA seleccionadas usando UGENE-
2006 2009 2011
CLUSTALO, MUSCLE, servidor T-COFFEE y/o PROBCONS de Max Planck. Editar a ojo.
Transformar el alineamiento limpio en un modelo pHMM (\cygwin\bin\HMMBUILD --
informat afa --amino NOMBRE.HMM NOMBRE.afa) y luego en un logo de secuencia
(usando HMMER v3.1b2 o el servidor SKYLIGN).
Basado en el patrn de conservacin, inferir los sitios funcionales en el perfil.
Reconstruir e interpretar los arboles filogenticos en MEGA (NJ y maximum likelihood) con
parmetros de BOOTSTRAP, GAMMA RATES. Hacer el ML en servidores RAxML o BAYESIAN
ANLISIS a phylogeny.fr con Mr. BAYES o PHYLOBAYES (<30 taxa).
Encontrar (PDB) o hacer (I-TASSER) la estructura 3D, visualizar y modelar con CHIMERA
destacando los SNPs patognicos, motivos conservados, hidrofbicos, cargas (ElecPot), Special article series: NATURE REVIEWS GENETICS www.nature.com/reviews/genetics
conservation CONSURF). Applications of next-generation sequencing (2009-2012); Genome-wide association studies;
Regulatory elements; Non-coding RNAs; Translational genetics; Modes of transcriptional regulation.
Remetir archivos de Resultados, report en WORD, presentacin (seleccionado) Powerpoint.

Genome On-Line Databases Some Useful Genomic Websites for


Sequence Data and Bioinformatic Tools
NCBI - http://www.ncbi.nlm.nih.gov/genome/mapview/
GIGAdb - http://gigadb.org/ Major sequence databases and tools:
UCSC Genome Browser - http://genome.ucsc.edu/ NCBI (USA) http://www.ncbi.nlm.nih.gov/
diARK - http://www.diark.org/diark UCSC Goldenpath Browser (USA) http://genome.ucsc.edu/
GOLD - http://www.genomesonline.org/cgi-bin/GOLD/index.cgi Broad Institute (USA) http://www.broadinstitute.org/
DOE Joint Genome Institute (USA) http://genome.jgi-psf.org/
1. Deep-sequencing the DATA Projects
Washington U. Genome Inst. (USA) http://genome.wustl.edu/
Studies Complete Projects 6.653 J. Craig Venter Institute (USA) http://www.jcvi.org/
Metagenomic 565 Permanent Drafts 23.565 Sanger (Ensembl) (UK) http://www.sanger.ac.uk/
Non-Metagenomic 20.049 Incomplete Projects 29.894 European Bioinformatics Institute (UK) http://www.ebi.ac.uk/
Biosamples Targeted Projects 1.931
Classification Organisms Climb, GigaDb (China) http://climb.genomics.cn/
Ecosystems Organisms 59.568 KEGG (Japan) http://www.genome.jp/kegg/
Host-associated 11.980 Archaea 1.078
Engineered 1.705 Bacteria 45.075 Bioinformatic software:
Environmental 7.177 Eukarya 9.058 Unipro UGENE 1.16, (HMMER 3.1b2) http://ugene.unipro.ru/
MEGA 6 (phylogenetic analyses) http:// http://www.megasoftware.net/
2. Bioinformatic deciphering of the INFORMATION FASTA (download package) http://fasta.bioch.virginia.edu/fasta_www2/fasta_list2.shtml
CLUSTAL Omega (EBI server) http://www.ebi.ac.uk/Tools/msa/clustalw2/
LOGOMAT-M & SKYLIGN (servers) http://www.sanger.ac.uk/resources/software/
Phylogenetic analysis (server) http://www.phylogeny.fr/version2_cgi/index.cgi
RAxML (ML phylogeny server) http://phylobench.vital-it.ch/raxml-bb/
VISTA tools for comparative genomics http://genome.lbl.gov/vista/index.shtml
World Tour of Genomic Resources http://www.openhelix.com/cgi/tutorialInfo.cgi?id=119
Specialized Websites and Software for Bioinformatic Sequence Analysis
aLeaves (homologs, alignment, tree building)
http://http://aleaves.cdb.riken.jp/aleaves/
CHIMERA (UCSF, 3D modeling of PDB files) https://www.cgl.ucsf.edu/chimera/
CLUSTALO (EBI server)
http://www.ebi.ac.uk/Tools/msa/clustalo/
DiArk (Genome sequence download for all species ) http://www.diark.org/diark
DOCK (UCSF)
http://dock.compbio.ucsf.edu/DOCK_6/index.htm
FASTA (download programs) http://fasta.bioch.virginia.edu/fasta_www2/fasta_list2.shtml
HMMER (webserver and download) http://hmmer.janelia.org/
MEGA (NJ & ML phylogeny) http://www.megasoftware.net/
PFAM (Protein Family Database domains, alignments, pHMM, trees)
http://pfam.xfam.org
PHYLIP (phylogeny programs)
http://evolution.gs.washington.edu/phylip/ http://www.ncbi.nlm.nih.gov/snp/
Phylogenetic analysis (server) http://www.ncbi.nlm.nih.gov/variati
http://www.phylogeny.fr/version2_cgi/index.cgi on/
PSI-PRED (Protein 2y structure, folding, modeling) http://bioinf.cs.ucl.ac.uk/web_servers/ http://www.ncbi.nlm.nih.gov/omim
RAxML (ML phylogeny server) http://phylobench.vital-it.ch/raxml-bb/ /
SFLD (Structure Function Linkage Database -Enzymes) http://sfld.rbvi.ucsf.edu/django/
Skylign logos (server) http://skylign.org/
SMART (Simple Modular ArchitectureResearch Tool pHMM) http://smart.embl-heidelberg.de/
Unipro UGENE (HMMER, etc) http://ugene.unipro.ru/
VISTA tools for comparative genomics http://genome.lbl.gov/vista/index.shtml
World Tour of Genomic Resources
http://www.openhelix.com/cgi/tutorialInfo.cgi?id=119

DNA The Central Dogma


DNA is the main repository of hereditary info.
Every cell contains a copy of the genome cellular structure / function
encoded in DNA
Each chromosome is a single DNA molecule protein

A DNA molecule consist of an ordered sequence


of nucleotides
protein folding (via chaperones)
The discrete nature of DNA composition allows us amino
RNA acid
to treat it as a sequence of As, Cs, Gs, and Ts polypeptide
AGC S
DNA is replicated during cell division, largely CGA R
transcribed into various RNAs, some of which UUR L
encode protein amino acids translation (via ribosome) GCU A
RNA messenger GUU V
Only mutations on the germ line persist as ... ...
evolutionary changes
transcription (via RNA polymerase)
Mutations may be conserved or variable,
DNA
reflecting their evolutionary selection for
functional importance in adaptation
Eukaryotic Gene Syntax Genomic Features of Interest
complete mRNA
Alternative Splicing Protein isoforms
coding segment
More than 50% of genes produce more than one kind of transcript
ATG TGA

exon intron exon intron exon


ATG . . . GT AG ... GT AG . . . TGA
start codon donor site acceptor site donor site acceptor site stop codon

Regions of the gene outside of the CDS are called UTRs (untranslated regions), and are mostly
ignored by gene finders, though they are important for regulatory functions.

Modrek et al. Nature Genetics 30, 13 19 (2002)

Genomic Features of Interest Genomic Features of Interest


Gen(om)e duplication - Most genes are members of gene familes
Recurrence of DOMAINS
Sequence definition:
Distinct regions of protein
sequence that are highly
conserved in evolution,
frequently associated with a
defined functional role.

e.g. the SH2 domain is found embedded in a wide variety of metazoan


proteins that regulate functionally diverse processes.
Pawson, T. et al., Trends in Cell Biology Vol.11 No.12 December 2001 Glusman et al. Genome Research 2001
TRANSCRIPTOME RNAs (4% coding versus 96% noncoding) Human Genomic Elements - unique and repeated sequences

Richards (2011)
The Human Genome Guide
Figs. 12.11
lncRNA lincRNA

Small nuclear ribonucleic acid (snRNA): RNA splicing (removal of introns from hnRNA), regulation of
transcription factors or RNA polymerase II, and maintaining the telomeres. They are always associated with
specific proteins, and the complexes are referred to as small nuclear ribonucleoproteins (snRNP) often
pronounced "snurps". These elements are rich in uridine content.
Small nucleolar RNAs (snoRNAs). Small RNA molecules that play an essential role in RNA biogenesis and
guide chemical modifications of ribosomal RNAs (rRNAs) and other RNA genes (tRNA and snRNAs).
MicroRNAs (miRNAs) Post-transcriptional regulators that bind complementary sequences on target mRNA
transcripts (mRNAs), usually resulting in translational repression or target degradation and gene silencing.
Small interfering RNA (siRNA), short interfering RNA or silencing RNA, is a class of double-stranded RNA
molecules, 20-25 nucleotides in length, that play a role in the RNA interference (RNAi) pathway, where it
interferes with the expression of a specific gene.
Long noncoding RNA (lncRNAs) non-protein coding transcripts >200 nt, some intergenic (lincRNAs)..

Alu SINE Repetitive Genomic Elements (transposons) Gene/Genomic Features

Originated in primates, 70 Mya


Active replication in humans

Roles of Repetitive Genomic Elements


Primate Phylogeny and Comparative Genomics Useful Webservers Genome Annotation
- human-specific segmental duplications UCSC Genome Browser: https://genome.ucsc.edu/

Useful Webservers Genome Search & Annotation EL CDIGO EPIGENTICO


(modificaciones reversibles de DNA y histonas)

http://www.ensembl.org/index.html - Seales (especfica secuencia) e.g. ncRNA, factores transcriptionales.


- Escritores (reclutamiento de modificadores) e.g. Ac/Me transferasas
- Marcas (escritura) e.g. Me, Ac.
- Lectores (interpretar seales , independiente de RNA y TFs)
- Borradores (restorar seales) e.g. demetilasas, desacetilasas.

Escritores

Marcas

M
e
Seales especficas

DNA
EL CDIGO EPIGENTICO La metilacin del DNA en islas CpG de los promotores
(modificaciones reversibles de DNA y histonas) est asociada con el silencimiento de un gen.
Las MARCAS estan reconocidos por LECTORES y BORRADORES:
complejos de proteinas de cromatina responsable de la me
estructura y funcin, independiente de las seales especficas me
me
de secuencia del DNA.
me
me
Borradores (e.g demethylasas, desacetylasas)

CpG density
Escritores Lectores
e.g. Ac/Me transferasas
Marcas
M
e

Gene

Covalent histone Acetylated (H3K9Ac) = ACTIVE


tail modifications (eucromatina suelta)
Methylated (H3K9me3)=INACTIVE
(heterocromatina compacta)

Histone H3 Lysine 9
( = H3K9 )

Adapted from Felsenfeld & Groudine, 2003


Qu es la Bioinformtica?
Resumen del Cdigo Epigentico
Es una ciencia tcnica basado en conceptos de la biologa molecular, a
los que se aplica tcnicas informticas (hardware y software) de la
Seales especfica de secuencia unen a DNA. matemtica, estadstica y ciencias de la informacin para organizer,
interpretar y entender la significancia de esas moleculas en su conjunto
Escitores reconocen estas seales y aaden
marcas epigenticas.
La memoria epigentica se puede retener a pesar Cmo se usa la Bioinformtica en la Prctica?
de la perdida de las seales originales. - datos, algoritmos, investigacin
Lectores unen a marcas epigenticas y recluten Guardar/recuperar informacin biolgica (bases de datos)
complejos macromoleculares. Recuperar/comparar secuencias de genes y proteinas
Predicir la funcin de genes/proteinas no conocidos
Borradores pueden quitar estas marcas y los Buscar una funcin conocido entre homlogos de un gen
complejos macromoleculares associados. Integrar los anlisis con datos de otros fuentes
Compilar/distribuir datos en publicaciones, presentaciones

A bioinformatics program may Bioinformatic Methodology for Gene Family Analysis


Identify a gene, RNA or protein, by appropriate matching criteria to homologous sequences.
1. HOMOLOG SEARCH raw data:
Determine the functional sites (e.g. promoters, exons, introns, termination sequence, type of RNA,
protein ORF functional domains, interactions, etc.) Identify other sequences with a common evolutionary history, using sequence
comparison identity/similarity statistics from FASTA, BLAST and HMMER.
Evaluate anomalous, deleterious or inconsequential changes by comparative analyses and evolutionary
modeling of of large datasets. 2. MULTIPLE SEQUENCE ALIGNMENT (MSA) organized data:
Build models based on 1y, 2y, 3y and 4y (sequence) structure, to visualize and interpret function (eg. Align common sites, introducing gaps where necessary.
alignments, 2D-folding, protein 3D or 4D chromosome conformation capture. 3. PHYLOGENETIC ANALYSES evolutionary tree reconstruction:
Analyze and statistically evaluate population data to assess disease association. Chart the order, timing and rates of gene duplication and speciation
throughout the gene family evolutionary history. Distinguish paralogs
molecular genomics (duplicated gene in one specie) and orthologs (same gene in other species) using
biology distance, likelihood and Bayesian algorithms.
genetics mathematics 4. HIDDEN MARKOV MODEL (HMM) structure-function profiling: Compute
Bioinformatics site-specific, probability statistics on selected sub-alignment and visualize with
biochemistry statistics
sequence logo. Use as tool for improved homolog search and realignment.
&
numerical 5. KEY FUNCTIONAL RESIDUES: Specificity determining positions (SDP)
biophysics Computational Biology analysis identify conserved changes between subfamilies, polymorphisms (SNP), rare
mutations, identify possible phenotypic variants.
algorithmics 6. STRUCTURAL MODELING informative 2D and 3D models:
evolution
image Visualize spatial context to infer functional role(s), identify known,
data
discontiguous domains, enzyme active site, interaction ligands, conformational
analysis management
stability.
Prcticas de Bioinformtica
Buscaminas, anlisis bioinformtico comparativo, modelado, Objetivos de las Prcticas del Laboratorio Bioinformtico
Aplicada a Familias inferencia y presentacin
Gnicas Estructura Funcin Fenotipo

6-8 mayo 8-9 mayo 12-16 mayo


Informcin Buscar Ordenar Caracterizar Modelo Inferir
Gen-Proteina Homlogos Alineamiento Estructura HMM Funcin
PubMed, Mim FASTA, Cobalt,Clustal, PDB, SMART, HMMER 3.1f, HMM+LOGO
GeneCards, (PSI)BLAST Muscle, Mafft CDD, PsiPred SKYLIGN
Ensembl (JACK)HMMER

19-23 mayo 26-30 mayo 2-6 junio


Reconstruir Clasificar Modelos Estruc-Func. Integracin Presentacin
Filogenia Subfamilias perfil HMM Modelado 3D Datos, Info Datos,
Programas, Ortlogos HMMER3 PDB, SNPs, Report
Algoritmos y Parlogos LOGOMAT ENZIMAS, Depositar Escrito y oral.
Parmetros: pHMM Aplicaciones CONSURF resultados y
NJ,MP,ML,Mega, Evaluacin reportaje.
Bayes,RAxML, de rboles Organizar
PFAM, SFAM presentacin

Genomic Analysis based on Molecular Evolutionary Modeling Phylogenomics


BLAST, FASTA, HMMER Ugene, M-Coffee MEGA Comparison of genes and gene products across a
NJ,ML,PhyloBayes
PHYLIP, PAUP
number of species (whole genomes), characterizing
homologues and gain insights in the evolutionary
Pairwise sequence comparison Phylogenetic
process itself.
databases gDNA, cDNA, aa classification
1D: sequence homology 2D: alignment Pharmacophylogenomics is the use of phylogenomics
HMMER, Haploview in aid of drug discovery, through improved target
Phylofoot, Evoprint selection, validation and risk assessment.

Homologue Relationships
Molecular profiling (pHMM)
Orthologues :
any gene pairwise relation where the ancestor node is a
Docking, Consurf, Dal CONSURF, CHIMERA
speciation event
SDP-Pred
DIVERGE Paralogues :
K-estimator
any gene pairwise relation where the ancestor node is a
Computational biology duplication event.
Functional information (tracing subfamily conservation,
in 3D context 3D: structural models divergence, rates and patterns)
Introduction to Sequence Analysis 1. BSQUEDA DE HOMLOGOS
Buscar en NCBI-GENE la especie referente y isoforma ms larga o comn y bajar el
FASTA.
a) Usar programa EBI-FASTA con Uniprot para obtener pairwise sequence scores.
Substitution matrices b) NCBI-BLASTP o PSI-BLAST con db NR busca dominios conservados.
c) PHMMER y JACKHMMER usan una proteina, Aln o HMM para buscar proteinas
con un perfil completo, y puede encontrar homlogos ms distantes en Uniprot.
Siempre pedir hasta 1000+ resultados para (sin cambios de otros parmetros) y
seleccionar las casillas de 150-200 homlogos (ortlogos, parlogos, outgroup) dispersos.
Guardar la seleccin de 150-200 bien distribuida en formato FASTA completo mirando
la cobertura, identidad/similtud, score alto, E-value y los alineamientos al fondo.
Se puede introducir nombres cortos en el FASTA para evitar conflictos con otros
programas. Guardar tambin la pgina de resultados de (la grfica), estadsticas y
alineamientos como texto con reformat results, para revisar si hace falta.
Tambin se puede guardar todos los datos en formatos FASTA completo y GenBank
para tener toda la informacin de las anotaciones.
Archivo en formato FASTA:
>NOMBRE Descripcin cambio de linea
Secuencia (aa o nt)
The empirical frequency with which aminoacid type i is replaced by type j (or >TYR-Hsa gi|4507753|ref|NP_000363.1| tyrosinase precursor [Homo sapiens]
viceversa) is written as Mi,j in the matrix: the probability of aligning two Ys in an ACDEFGHI ...
alignment YY/YY is 10+10=20, a very significant score, whereas that of YY/TP is 0- Incluir ortlogos y parlogos como homlogos, y algunas especies distantes que
5=-5. sirven como outgroup para la raiz de los rboles filogenticos.
Consultar EBI-PFAM, Superfam y Sanger-Ensembl para ver la familia completa.

2. ALINEAMIENTO de HOMLOGOS Homolog Search General Considerations


Usar al menos 3 programas distintos (EBI-Clustalo, Ugene-Muscle y GOALS: Organize putative homologs into a multiple sequence alignment;
Web-T/M-Coffee (con hasta 10 programas) para alinear las analyze its site-specific variation to produce a comprehensive sequence
secuencias del archivo de homlogos en formato FASTA. Puedes profile (logo), determine the evolutionary duplication history to confirm
encontrar todos en el sitio web de Max Planck Bioinformatic Tools, homolog relationships, and incorporate useful information into 3D models.
SUMMARY: Collect putative homologs into a FASTA file, align them,
Comparar los resultados en Ugene y encontrar errores en cada generate a pHMM, reconstruct phylogenetic trees, build 2D and 3D models.
alineamiento! M-Coffee y Prob-Cons identifican secuencias y regiones
1. HOMOLOGS: Select a full-length, representative protein for distant
difciles de alinear con cualquier programa. searches (or nucleotide for very similar searches, eg. single nt variants).
En Ugene, suprimir las columnas con una mayora de gaps, y Choose the appropriate BLAST program (eg. BLASTP or PSI-BLAST) and
filas que representan fragmentos con al menos 75% de los aa que la database (e.g. nr) to target the homolog search for high-scoring pairs-
protena de referencia completa. Reemplazar con ms secuencias HSPs with low E-value, high coverage.
completas. Eliminar duplicados, establecer nomres cortos e Adjust parameters by specifying taxonomy/organism, raising number of
informativos. output descriptions (e.g. 1000), restrict by lowering E, optionally change
substitution matrix.
Se debe guardar (o exportar de Ugene) el alineamiento en
Select >100 true homologs of interest, representing the full range of species
formatos Clustal (.aln), aligned FASTA (.afa), MEGA (.meg), preferible orthologs, and related paralogous genes. Download to a FASTA file and a
con un nombre mas corto. Incluir la extensin correspondiente para separate GenBank file containing annotations, then adjust and reformat
pasar a en otros programas como HMMER, programas de filogenia page as text to output and save text file of all results (names, scores,
(MEGA, Phylobayes, RAxML), PSI-Pred (estructura 2) etc. alignments).
Generate and save a multiple alignment directly from results using COBALT.
Hands-on Practice: Homolog Search (FASTA)
1. Global (full) sequence match a protein (in FASTA
format) search to a comprehensive protein database such
as UniProt+TrEMBL at EBI:
http://www.ebi.ac.uk /Tools/sss/fasta/
Select the the UniProtKB target database to compare with
your Protein sequence (usu. in fasta format) using the
FASTA (or more rigorous SSEARCH) program, adjusting
parameters, as follows.
- Blosum50 (substitution matrix) and high-scoring pairs
(HSP)=no are OK;
- Expect=1 if you anticipate many matches, Scores=1000,
Align=500-1000. Submit.
Results: Job ID: fasta-I20141127-124108-0089-4162265-oy
Notes: Select full protein database (SP+TrEMBL), rapid
FASTA algorithm, with lower E-value (more stringent) and
higher limit for estimated number of putative homologs

Local Alignment Statistics BLAST (PSI-)BLASTP and TBLASTN search


High scores of local alignments between two random sequences http://blast.ncbi.nlm.nih.gov/Blast.cgi? with ANXA10.PRO (Fasta format)
CMD=Web&PAGE_TYPE=BlastHome against nr protein or TSA nt
follow the Extreme Value Distribution
databases, looking for 1000+
Expect Value matching sequences with default
E=10 and full-length coverage.
E = number of database hits you expect to
find by chance
size of database

E = Kmne-S or E = mn2-S
Alignments

your score
K = scale for search space
= scale for scoring system
expected number of
Scor random hits S = bitscore = (S - lnK)/ln2
e
(applies to ungapped alignments)
BLAST Output Domains, Graphics, Selectable Descriptions with BLASTP simple pairwise comparison of TBLASTN pairwise comparison of
2 protein sequences in FASTA format MAPT.PRO vs human chromosomes
Scores, Pairwise Alignments
1. Graphics Summary
PSI-BLASTP search at
http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastHo
me
ANXA10.PRO (Fasta format)

2. Descriptions
Output can be reformatted prior to Use pairwise alignments to
export evaluate/select results

3.Alignments

NCBI - Bases de Datos NCBI-Gene: Empezar aqu, seguir vnculos, seceuencias, mapas, etc.

MAPT
gene linkage
exon organization
alt. spliced transcripts
microtubule-binding domains
http://www.ncbi.nlm.nih.gov/snp/
http://www.ncbi.nlm.nih.gov/variation/
http://www.ncbi.nlm.nih.gov/omim/
NCBI BLAST versiones, mirar grficos, descriptores y sus parmetros, alineamientos.
Guardar resultados, download todos las secuencias (FASTA y GenBank) y de las seleccionadas
BLASTP simple pairwise comparison TBLASTN pairwise comparison of
of 2 protein sequences in FASTA MAPT.PRO vs human chromosomes
format
1. Resumen grfica

Profile HMMs
2. Nombres, cdigos, descripcines
Perfiles de Modelos ocultos de
Markov
3.Alineamientos

4. Seleccin de
hmlogos putativos,
amplia, diversa,
ortlogos y parlogos,
outgroup

Andreyevich Markov 1856-1922 Subfamily HMM construction


MARKOV MODELS System state is fully System state is partially 1. At completely conserved positions, and
observable observable subfamily gapped positions: Use match
Autonomous system Markov chain Hidden Markov model state distributions estimated for general
(profile HMM)
(family) HMM.
Controlled system Markov decision Partially observable 2. At other positions:

Error
process Markov decisin
process 1. Estimate Dirichlet mixture density
posterior for each subfamily at each
position separately.
2. Use Dirichlet density posteriors to
weight contributions from other
Markov chain models the state of a system with a random variable that subfamilies.
changes through time, dependent on previous state. 345
12
Hidden Markov model is a Markov chain for which the state is only partially 67
observable (e.g. MSA position lacking some aa).
3. Compute amino acid distribution
Markov decision process is a Markov chain in which state transitions depend using weighted counts and standard
on the current state and an action vector that is applied to the system. Dirichlet procedure.
Brown et al,Subfamily HMMs in functional genomics (2005) Pacific Symposium on Biocomputing
Subfamily HMMs increase the separation
Algunos conceptos importantes
between true and false positives Surgen mtodos de comparacin de secuencias
515 unique SCOP folds 1.5% error rate in
para la bsqueda de homologas como son los
PFAM full MSAs subfamily classification patrones, perfiles (conjunto alineado de
Scored against Astral PDB90 using top-scoring SHMM secuencias que contiene un dominio) y HMM
(modelos estadsticos de la estructura primaria de
las secuencias).
Motivo: si observamos un alineamiento mltiple
de protenas homlogas veremos que algunas
columnas varan bastante, mientras que otras
estn ms conservadas. Cuando observamos
ciertas columnas cercanas con una alta
conservacin, es decir, cuando encontramos
trocitos de las secuencias que se conservan ms
que otros y que podran caracterizar
funcionalmente a las protenas, entonces solemos
hablar de MOTIVOS.

Algunos conceptos importantes Algunos conceptos importantes


Existen distintos mtodos para describir y localizar motivos:
1. Expresiones regulares o patrones: A partir de la informacin que contiene un
alineamiento mltiple se obtiene un patrn o expresin regular utilizados para
caracterizar motivos, indicando qu posiciones son ms importantes y cuales
pueden variar y que variaciones pueden sufrir.
2. Creacin de perfiles: Es una matriz de sustitucin especfica para cada posicin de la
secuencia. A partir del alineamiento mltiple se construye dicha matriz teniendo en
cuenta la frecuencia de los aminocidos en cada posicin as como sus propiedades
fisicoqumicas.

Una diferencia entre los perfiles y las expresiones regulares o patrones es que no
solo se limita a pequeas regiones con un alto ndice de similitud, sino que
presenta una mayor utilidad a la hora de definir regiones o dominios ms
extensos que puedan caracterizar familias de protenas ms que motivos. El perfil
puede cubrir tanto regiones conservadas como variables del alineamiento.
Una diferencia entre los perfiles y las expresiones regulares o patrones es que no
solo se limita a pequeas regiones con un alto ndice de similitud, sino que
presenta una mayor utilidad a la hora de definir regiones o dominios ms
extensos que puedan caracterizar familias de protenas ms que motivos. El perfil
puede cubrir tanto regiones conservadas como variables del alineamiento.
Algunos conceptos importantes Introduccin
3. Perfiles HMMs: Se muestran como una forma ms Los modelos de ocultos de Markov (HMM) surgieron
sensible, incluyendo los patrones reguladores y como una herramienta aplicada al procesamiento del
perfiles convencionales, de bsqueda de homlogos habla, un modelos estadstico que, a travs de un
remotos y dominios conservados basados en una algoritmo de aprendizaje, extraa las principales
descripcin estadstica de la estructura primaria caractersticas estocsticas de una cadena de habla.
consenso de una familia de protenas. Con la ingente cantidad de datos proveniente del
En el modelo HMMs que vamos a analizar secuenciamiento de distintos genomas, aparece un
consideramos tres estados posibles correspondientes problema adjunto -> cmo extraer de estos datos la
a la probabilidad de encontrar en dicha posicin un informacin subyacente.
determinado residuo, la probabilidad de insercin y Solucin: los HMM.
de delecin

Modelos Ocultos de Markov Modelos Ocultos de Markov


Un modelo oculto de Markov (HMM) es Alfabeto = { b1, b2, , bM }
un conjunto finito de estados. Conjunto de estados = { 1, ..., K }
Probabilidades de transicin entre dos estados cualesquiera
aij = prob. de transicin del estado i al estado j
Las transiciones entre estados estn
ai1 + + aiK = 1, para todos los estados i = 1K
dadas por un conjunto de probabilidades
de transicin. Probabilidades iniciales a0i
a01 + + a0K = 1
Probabilidades de emisin dentro de cada estado
En cualquier estado particular, la ei(b) = P( xi = b | i = k)
observacin puede ser generada, de ei(b1) + + ei(bM) = 1, para todos los
acuerdo a la distribucin de
estados i = 1K
probabilidades de emisin.
En cada paso de tiempo t, lo nico que afecta los futuros estados es el estado
actual t
Slo el resultado observable, no el P(t+1 =k | cualquier cosa que pas) =
estado, es visible a un observador externo P(t+1 =k | 1, 2, , t, x1, x2, , xt)=
por lo que los estados estn ocultos. P(t+1 = k | t)
Decodificacin
Las 3 grandes preguntas sobre HMM
Dada una secuencia de observaciones X, encuentre la
Evaluacin secuencia de est. .
Dado un HMM M y una secuencia x, encontrar DNA coding (C) vs non-coding (N)
Prob[ x|M] x = AACCTTCCGCGCAATATAGGTAACCCCGG
Decodificacin = NNCCCCCCCCCCCCCCCCCNNNNNNNN
Dado un HMM M, y una secuencia x, encontrar la
secuencia de estados que maximiza P[ x, | M ] Queremos encontrar = 1, , N, 1 1 1
1

Aprendizaje tal que P[ x, ] est maximizado 2 2 2


2

Dado un HMM M, con probabilidad * = argmax P[ x, ] K K K
K

transicin/emisin desconocidas, y una Podemos usar programacin dinmica x x x xK


secuencia x, 1
2 3

encontrar los parmetros = (ei(.), aij) que Sea Vk(i) = max{1,,i-1} P[x1xi-1, 1, , i-1, xi, i = k]
maximizan P[ x | ]
= Probabilidad de la secuencia de estados ms
verosmil que termina en el estado i = k

Algoritmo de Viterbi Agoritmos de Viterbi y Forward

VITERBI FORWARD
Inicializacin: Inicializacin:
V0(0) = 1 f0(0) = 1
Vk(0) = 0, para todo k > 0 fk(0) = 0, para todo k > 0
Es similar a alinear un conjunto de estados de una secuencia.
Complejidad temporal: O(K2N) K=n estados Iteracin: Iteracin:
Complejidad espacial: O(KN) N=longitud Vj(i) = ej(xi) maxk Vk(i-1) akj fl(i) = el(xi) k fk(i-1) akl

Terminacin: Terminacin
P(x, *) = maxk Vk(N) P(x) = k fk(N) ak0
Algoritmos de entrenamiento Algoritmos de entrenamiento
Tenemos un conjunto de secuencias de ejemplo del
tipo de las que queremos que el modelo ajuste
Objetivo: Dada una secuencia de observaciones,
(secuencias de entrenamiento), que suponemos encontrar el modelo ms probable que genere esa
independientes. secuencia
Si conociramos el camino de estados que recorri el Problema: No conocemos las frecuencias relativas de
modelo, los estados no estn ocultos (el HMM se los estados ocultos visitados.
transforma en una cadena de Markov), en la cual los No se conocen soluciones analticas
estimadores de mximoa verosimilitud para las Nos acercamos a la solucin por sucesivas
frecuencias de emisin y transicin se obtienen a aproximaciones.
partir de las frecuencias de observaciones. El problema ahora es la optimizacin, por lo que se
Si tenemos informacin (biolgica o fsica) que nos pueden usar muchas heursticas (simulated
aporte informacin previa a la distribucin de annealing, algoritmos genticos, etc)
probabilidades podemos agregrsela al modelo como
pseudocuentas.

Algoritmo de Baum-Welch Aplicaciones de los HMM


Los modelos probabilsticos estn tomando una
Este es el algoritmo de Expectation-Maximization (EM) para la mayor importancia en el anlisis biolgico,
estimacin de parmetros. particularmente en problemas de anlisis con
Aplicable a cualquier proceso estocstico muchos parmetros.
Encuentra las frecuencias esperadas de los posibles valores de las
variables ocultas. Puesto que muchos problemas en biologa
Calcula las distribuciones de mxima verosimilitud de las variables computacional se reducen al anlisis de secuencias
ocultas en base a las probabilidades forward y backward. lineales cortas, los modelos basados en HMM han
Repite estos pasos hasta satisfacer algn criterio de convergencia. sido aplicados a muchos problemas
Complejidad temporal: n iteraciones*O(N2 T)
Bsqueda de genes, mapas hbridos de radiacin,
unin de mapas genticos, anlisis filogentico y
prediccin de la estructura secundaria de las
protenas.
Las aplicaciones ms exitosas son los perfiles HMM y
HMM-based gene finders.
Perfiles HMMs
Perfiles HMMs
En M1 se emiten los smbolos de los
A partir de un HMM entrenado con un conjunto de aminocidos (A1..Al) con las
secuencias previamente alineadas (CLUSTAW) se probabilidades de emisin que resultan
puede obtener las caractersticas estocsticas de la frecuencia de aparicin de stos
(profile) de una familia de secuencias de ADN o
protenas. en la columna1 de las secuencias
En las protenas se observan regiones de longitud presentadas como datos
considerable donde no participan gaps ni inserciones Se fuerzan a 1 las probabilidades de
de residuos. transicin entre un estado y el
Se puede construir un modelo donde slo participen siguiente.
los estados de match, con probabilidad 1 de
transicin entre un estado y el siguiente y con En las secuencias de aminocidos se
probabilidades de emisin de residuos calculadas a observan porciones donde es posible
partir de su frecuencia de aparicin.
hallar consenso (estados de match) y
otras donde o bien aparecen insert o
gaps (estados delete).

Perfiles HMMs
SOFTWARE PARA PERFILES
HMM
Hay mltiples paquetes de software que
estn disponibles para implementar
perfiles HMM:

La Figura 2 muestra un HMM para un alineamiento de cuatro secuencias con tres posiciones.
La principal diferencia que existe entre ellos es la
arquitectura que adoptan: Un HMM est compuesto por
una serie de nodos o estados
cada uno de los cuales emite
smbolos (entre 4 o 20 posibles
aminocidos) con una Hay dos modelos diferenciados para el autor:
probabilidad dada.
Los estados estn conectados
secuencialmente existiendo Modelos de perfiles: modelos con estados de
probabilidades de transicin
entre ellos. Adems existen insercin y borrado asociados con cada estado
probabilidades de insercin y
borrado.
encontrado, permitiendo inserciones y borrados en la
BLOCKS y META-MEME
secuencia seleccionada.
representan los modelos de Modelos de motivos: modelos dominados por cadenas
motivos, los clsicos HMM .
de estados encontrados (modelando bloques sin
HMMER2 Plan7 y profile
HMM representan la nueva huecos de secuencias consenso), separados por un
generacin de perfiles HMM en
SAM, HMMER y PFTOOLS.
pequeo nmero de estados insertados modelando los
espacios entre los bloques sin huecos.

SAM y HMMER
SAM, HMMER, PFTOOLS y HMMpro Usan mezclas Dirichet en muchas distribuciones para
implementan modelos basados al ayudar al numero de parmetro libres. Si adoptan el
menos en una parte en los perfiles hibrido HMM/neural network techniques esto se
originales HMM de Krogh (1994). acenta.
Estos paquetes estn argumentados HMMER y PFTOOLS
en un simple modelo que trata con Son usados en primer lugar para construir bases de
mltiples dominios, secuencias datos de bsqueda de modelos donde estn presentes
alineadas y alineamientos locales. los alineamientos.
PROBE, META-MEME y BLOCKS
El alineamiento local o global no es
necesariamente esencial en el Asumen distintos modelos de motivos, los
alineamientos consisten en uno o mas bloques sin
algoritmo, pero esto demuestra que la
huecos, separados por secuencias intervening que son
probabilstica es una parte del modelo asumidas para ser aleatorias. PROBE y META-MEME
de arquitectura. adoptan modelos probabilsticos para los huecos.
GENEWISE
LIBRERIAS PARA PERFILES
Es una sofisticada aplicacin de
bsqueda por ventanas que puede HMM
tomar un HMMER de modelo de
protena. El software para perfiles HMM esta bien para:
Modelar una secuencia en particular de una familia de
inters.
Buscar secuencias homologas en una base de datos.
PSI-BLAST
Ahora necesitamos buscar una secuencia simple en una
No es una aplicacin HMM, pero usa librera de perfiles HMM.
los principios de los modelos Construir una librera requiere un largo nmero de
mltiples alineamientos de comunes dominios.
probabilsticos para construir HMM-
like models para mltiples
alineamientos.

Dos largas colecciones de perfiles HMMs estn disponibles:


Pfam Pfam
Es una base de datos compuesta por los perfiles
PROSITE
HMMs obtenidos para distintos dominios o
regiones conservadas de protenas.
Ambas bases de datos estn disponibles en la web:
Contiene mltiples alineamientos de protenas y
perfiles-HMMs de esas familias de protenas. Es una
base de datos semiautomtica, cuyo objetivo es ser
completa y exacta.
PROSITE
Es una base de datos que contiene informacin
detallada sobre todos los motivos de secuencia de
protena conocidos. Los motivos son descritos
mediante patrones regulares.
BLOCKS
Base de datos compuesta por perfiles HMMs
obtenidos para distintos dominios o regiones conservadas en las
protenas. El mtodo HMMs tambin es utilizado Es una base
de datos compuesta por pequeos segmentos de alineamientos
mltiples correspondientes a entradas en PROSITE. De hecho
PROSITE contiene perfiles para 290 dominios de BLOCKS es un sistema de deteccin de motivos ms que una
protenas, y Pfam contiene 1313. base de datos propiamente dicha.
Hay muchas discusiones sobre el nmero de PRODOM
familias de protenas que hay, el nmero de 1000 Es una BD de dominios de protenas generado
fue citado en alguna ocasin, otros defienden que automticamente desde SWISS-PROT y TrEMBL, consiste en una
todas las familias tienen aproximadamente el compilacin automtica de dominios homlogos. Construido
mismo nmero. utilizando un procedimiento mejorado basado en PSI-BLAST.
Ninguno de estos servidores de perfiles estn
maduros, ambas bases de datos para perfiles PRINTS
software estn rpidamente cambiando. La base de datos PRINTS es similar en concepto, pero se usa para
descargar bloques llamados "huellas dactilares", fingerprints. Ha sido
recientemente incluido como un servicio on-line de BLAST y un software de
bsqueda, proporcionando mayor eficacia y mejora estadstica para la
estimacin de la seguridad de las parejas recuperadas. Es un grupo de motivos
conservados para caracterizar una familia de protenas.

Crear un pHMM, usarlo para localizar dominios en


CONCLUSIN protenas.fa con ranking, o para alinear protenas.fa

El proyecto del genoma humano amenaza con


abrumarnos en un diluvio de secuencias de datos.
1. Crear un pHMM desde un alineamiento:
Las populares anotaciones de largas secuencias son HMMBUILD --informat afa PROTNAME.HMM PROTNAME.AFA
muy difciles para muchas personas. HMMBUILD --informat clustal PROTNAME.HMM PROTNAME.ALN

2. Localizar un dominio HMM en una(s) proteina(s):


El desarrollo de mtodos robustos para automatizar la HMMSEARCH -E 1000 -o PROTOUT.TXT PROTNAME.HMM PROT.FA.LIB
clasificacin y anotacin de secuencias es imperativo.
3. Alinear secuencias en un archive MULTI.fa basado en un pHMM:
Surge la esperanza de que desarrollando mtodos de hmmalign -o PROTNAME.aln --trim --amino --informat FASTA --outformat CLUSTAL PROTNAME.hmm PROTNAME.fa

perfiles HMM, se pueda suministrar una segunda lista 4. Exportar el concenso de las secuencias (aa mayoritaria) del HMM:
de stos que sean slidos, sensatos y HMMEMIT -o PROTOUT.TXT PROTNAME.HMM
estadsticamente basados en herramientas de anlisis,
que completen los anlisis BLAST y FASTA.
HMMER server Janelia Farm (HHMRI) http://hmmer.janelia.org/
ahora en EBI
Permits rigorous search using protein sequence or HMM vs curated databases. HMMSEARCH vs UniProt > 3000 results and taxonomy
Results by taxonomy (species domains), various formats, plus sequence logos.

No Hummer!

HMMER server Janelia Farm (HHMRI) http://hmmer.janelia.org/


Programs available on server:
phmmer - used to search one or more query protein sequences against a
protein sequence database.
hmmscan - search protein sequences against collections of profiles, e.g. Central role of multiple alignments
Pfam. In HMMER2 this was called hmmpfam. Phylogenetic studies
Comparative genomics
hmmsearch - used to search one or more profiles against a protein Hierarchical function annotation:
homologs, domains, motifs

sequence database.
jackhmmer - iteratively search a query protein sequence, multiple
sequence alignment or profile HMM against the target protein sequence Gene identification, validation Multiple alignment Structure comparison, modelling
database.

Other programs in suite:


hmmalign performs a multiple sequence alignment of all the sequences RNA sequence, structure, function Interaction networks

(usually identified by running an hmmsearch) in the input, by aligning them


individually to the profile HMM.
hmmbuild builds a profile HMM for each multiple sequence alignment in the
input multiple sequence alignment file, and saves it to a new file. Human genetics, SNPs Therapeutics, drug design
hmmconvert utility converts an input profile file to different HMMER formats. DBD insertion domain

hmmfetch retrieves one or more profile HMMs from a profile database (e.g. Therapeutics, drug discovery
LBD

Pfam). binding sites / mutations

hmmpress takes a profile database in standard HMMER3 format and


constructs binary compressed data files for hmmscan.
hmmstat utility prints out a tabular file of summary statistics for each profile.
Major Websites for Bioinformatic Data and Analysis
Perfil de Modelo Oculto de Markov Models NCBI (USA) http://www.ncbi.nlm.nih.gov/
(pHMM) refleja la Probabilidad Estadstica de HMMER http://hmmer.janelia.org/
la Relevancia Funcional Goldenpath UCSC Browser (USA) http://genome.ucsc.edu/
ExPASy http://www.expasy.org/tools/
Broad Institute (USA) http://www.broadinstitute.org/
DOE-JGI Joint Genome Institute (USA) http://genome.jgi-psf.org/
Washington U. Genome Institute (USA) http://genome.wustl.edu/
Baylor College of Medicine USA) http://www.hgsc.bcm.tmc.edu/
J. Craig Venter Institute (USA) http://www.jcvi.org/
Sanger (Ensembl) (UK) http://www.sanger.ac.uk/
EBI (European Bioinformatics Institute) http://www.ebi.ac.uk/
GigaDB (China) http://gigadb.org/
KEGG (Japan) http://www.genome.jp/kegg/
SFLD (Structure Function Linkage Database -Enzymes) http://sfld.rbvi.ucsf.edu/django/
PFAM (Protein Family Database domains, alignments, pHMM, trees) http://pfam.xfam.org
SMART (Simple Modular ArchitectureResearch Tool pHMM domains) http://smart.embl-heidelberg.de/
DiArk (Genome sequence download for all species ) http://www.diark.org/diark
CHIMERA (UCSF) https://www.cgl.ucsf.edu/chimera/
DOCK (UCSF), CLUSPRO http://dock.compbio.ucsf.edu/DOCK_6/index.htm
Unipro UGENE (HMMER, etc) http://ugene.unipro.ru/
aLeaves http://http://aleaves.cdb.riken.jp/aleaves/
FASTA (download package) http://fasta.bioch.virginia.edu/fasta_www2/fasta_list2.shtml
CLUSTALO (EBI server) http://www.ebi.ac.uk/Tools/msa/clustalo/
Skylign logos (server) http://skylign.org/
Phylogenetic analysis (server) http://www.phylogeny.fr/version2_cgi/index.cgi
PHYLIP (phylogeny programs) http://evolution.gs.washington.edu/phylip/
RAxML (ML phylogeny server) http://phylobench.vital-it.ch/raxml-bb/
VISTA tools for comparative genomics http://genome.lbl.gov/vista/index.shtml
World Tour of Genomic Resources http://www.openhelix.com/cgi/tutorialInfo.cgi?id=119

- p-values directly Modelos Ocultos de Markov


- no need for bootstrapping Un modelo oculto de Markov (HMM) es
un conjunto finito de estados.

Las transiciones entre estados estn


dadas por un conjunto de
probabilidades de transicin.

En cualquier estado particular, la


observacin puede ser generada, de
acuerdo a la distribucin de
probabilidades de emisin.

Slo el resultado observable, no el


estado, es visible a un observador
externo por lo que los estados estn
ocultos.
Specificity Determining Positions (SDP) in the ANXA Family
Results Alignment Amino acid Mutual Z-score
: List of position in A10BT3 information (Zp)
SDPs (Ip)

1 26 27Cys 9.38e-01 9.26 Details

2 192 192Cys 1.23e+00 9.14 Details


Prevalent
Group Alignment column 26
3 222
residue 222Tyr 1.44e+00 7.84 Details

4 145 146Met 1.32e+00 7.47 Details


ANXA10 C CCCCCCCCCCCCCCCCCCCCCCCC
5 198 198Gln 5.17e-01 7.24 Details

ANXA11
6 T
21 TTTTTTTTTTTTTTTTTTTTTTTTTTT
22Leu 1.03e+00 7.08 Details

7 229 229Leu 1.29e+00 6.86 Details


ANXA13 T TTTTTTTTTTTTTTTTTTTTTTTTTTT
8 233 233Thr 7.57e-01 6.53 Details
ANXA1 V VVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV
9 230 230Leu 9.69e-01 6.44 Details

10
ANXA2 187
V 187Leu 7.04e-01 6.24 page
VVVVVVVVVVVVVVVVVVVVVVVVVVVVVTVTVVVV

ANXA3 T TTTTTTTTTTTTTTTTTTTTT

ANXA4 T TTTTTTTTTTTTTTTTTTTTTTTTT

ANXA5 T TTTTTTTTTTTTTTTTTTTTTTTTTTTTT

ANXA6a S SSSSSSSSSSSSSSSSS

ANXA6b T TTTTTTTTTTTTTTTTTT

ANXA7 T TTTTTTTTTTTTTTTTTT

pHMM versatility: # HMMSEARCH :: search profile(s) against a sequence database, # HMMER 3.1b1 (May 2013); http://hmmer.org/
#------------------------------------
# query HMM file: /cygdrive/d/seq/anxhmm/ANXFDOM84.hmm
- Searches aa-nt # target sequence database:
# output directed to file:
AnxBacteria139.lib
AnxBacteria139_ANXFDOM84_out.TXT

- MSA realignment # sequence reporting threshold: E-value <= 100

Query: ANXFDOM84 [M=68]


Strong statistical basis - Motif/domain ID Scores for complete sequences (score includes all domains):
--- full sequence --- --- best 1 domain --- -#dom-
E-value score bias E-value score bias exp N Sequence Description

Increase the separation - Sequence logos ------- ------ ----- ------- ------ ----- ---- -- -------- -----------
1.4e-67 216.5 30.7 4.9e-24 77.4 0.5 6.4 5 gi|380728269|gb|AFE04271.1| hypothetical protein COCOR_01751

between true and false positives Markov chains supposed to be 4.9e-65 208.4 23.1 2e-21 69.0 0.2 6.0 5 gi|441490447|gb|AGC47142.1| hypothetical protein MYSTI_05866
1.4e-51 165.3 16.0 4.9e-20 64.6 1.3 6.7 5 gi|521998093|ref|WP_020509364.1| hypothetical protein [Actinoplan
independent but 1 site may 3.4e-40 129.0 26.7 4.5e-16 51.9 0.1 7.0 5 gi|380729140|gb|AFE05142.1|
2.7e-27 87.8 7.9 3.5e-16 52.3 0.1 3.2 2 gi|262078316|gb|ACY14285.1|
hypothetical protein COCOR_03294
Annexin repeat protein [Haliangi

Powerful modeling tool, versatile affect neighboring site 6.6e-21 67.4 0.4 1.4e-13 44.0 0.0 2.3 2 gi|406982961|gb|EKE04220.1|
1.3e-20 66.4 0.9 5.7e-19 61.2 0.2 2.6 2 gi|262077805|gb|ACY13774.1|
hypothetical protein ACD_20C0009
Annexin repeat protein [Haliangi
3.7e-20 65.0 0.6 5.3e-18 58.1 0.2 3.2 2 gi|262079386|gb|ACY15355.1| hypothetical protein Hoch_2831 [
input of training data 3.9e-19 61.7 0.4 1.2e-18 60.2 0.4 1.9 1 gi|262079485|gb|ACY15454.1| Annexin repeat protein [Haliangi

Convergence to true optimum 9.3e-18 57.3 0.1 1.6e-17 56.5 0.1 1.4 1 gi|516340685|ref|WP_017730718.1| hypothetical protein [Nafulsella
2.4e-16 52.8 0.4 5.4e-16 51.7 0.1 1.7 1 gi|497868487|ref|WP_010182643.1| hypothetical protein [Aquimarina

Modular for comining HMMs, may require a minimal number 2.1e-14 46.6 0.1 4.3e-14 45.6 0.0 1.6 1 gi|497867219|ref|WP_010181375.1| hypothetical protein [Aquimarina
4.1e-10 32.9 0.0 7.2e-10 32.1 0.0 1.4 1 gi|502689809|ref|WP_012925298.1| hypothetical protein [Spirosoma

library builds of observation (e.g. >20)


1.4e-07 24.8 0.0 1e-06 22.0 0.0 2.4 2 gi|498349864|ref|WP_010664020.1| hypothetical protein [Marinilabi
7.2e-07 22.5 0.0 8.6e-07 22.3 0.0 1.2 1 gi|496742354|ref|WP_009358655.1| hypothetical protein [Arthrobact
5e-06 19.8 0.0 9.8e-06 18.9 0.0 1.5 1 gi|490437607|ref|WP_004308637.1| beta-N-acetylhexosaminidase [Bac
3.8e-05 17.0 0.0 6.5e-05 16.3 0.0 1.4 1 gi|507053948|ref|WP_016124903.1| hypothetical protein [Bacillus c
Generation of sequence logo Input Training set selection and
4.5e-05 16.8 0.0 7.4e-05 16.1 0.0 1.4 1 gi|493827167|ref|WP_006774594.1| transcriptional regulator [Clost
0.00042 13.7 0.0 0.00072 12.9 0.0 1.4 1 gi|507064142|ref|WP_016134952.1| hypothetical protein [Bacillus c

makes data visually informative quality of MSA determine Domain annotation for each sequence (and alignments):
>> gi|380728269|gb|AFE04271.1| hypothetical protein COCOR_01751 [Corallococcus coralloides DSM 2259]

Incorporates prior knowledge accuracy and utility. # score bias c-Evalue i-Evalue hmmfrom hmm to alifrom ali to envfrom env to acc
--- ------ ----- --------- --------- ------- ------- ------- ------- ------- ------- ----
1 ! 77.4 0.5 2.4e-24 4.9e-24 5 67 .. 110 172 .. 106 173 .. 0.95

(from MSA), architecture can 2 ! 23.8 0.0 1.4e-07 2.9e-07


3 ! 35.8 1.2 2.5e-11 5.1e-11
13
3
67 .. 192 245 .. 180 246 .. 0.88
67 .. 274 332 .. 272 333 .. 0.94
4 ! 43.1 0.6 1.3e-13 2.6e-13 2 68 .] 340 406 .. 339 406 .. 0.94
restrain training 5 ! 55.0 1.4 2.4e-17 5e-17 4 63 .. 422 478 .. 419 483 .. 0.83

Alignments for each domain:

Fast, now comparable to BLAST == domain 1 score: 77.4 bits; conditional E-value: 2.4e-24
ANXFDOM84 5 klweavdglgtdEdavlkvlrgltpeqiaavakaYqkrYgkdlgddlkselsgdelkralell 67
+++++++g+gtdEd+++k+l+g+tpeqia+++++Yq++Ygk+l +++++el g++l+ra ll
searches gi|380728269|gb|AFE04271.1| 110 AMDGGMTGWGTDEDKIFKTLEGKTPEQIAMIRQSYQDHYGKNLDEKIRDELGGSDLQRAEGLL 172
6889*******************************************************9987 PP

== domain 2 score: 23.8 bits; conditional E-value: 1.4e-07


ANXFDOM84 13 lgtdEdavlkvlrgltpeqiaavakaYqkrYgkdlgddlkselsgdelkralell 67
+g +Ed +lk+l++++p++++a+a+ Y +++g++ + +++ ++++r+++++
gi|380728269|gb|AFE04271.1| 192 FGSNED-MLKILEKRSPAERHAIAQQYADMNGGTPAGQKPEDVLLARMGREMDGA 245
689999.********************************************9987 PP
Phylogeny Estimation
Traditional approaches

Neighbour-joining algorithm (distance scores)


Tree searches with optimality criterion Data collecting, the first step
Parsimony
Maximum likelihood (all sites and sequences) Typically, a few outgroup
http://atgc.lirmm.fr/phyml/ sequences are included to root
the tree.
Bayesian approaches (posterior probability
model)
Insertions and deletions obscure
i.e. Choose different models of nucleotide which of the sites are
substitution even if you dont know your data homologous.
composition or the patterns of evolution,
or use MODELTEST or PROTEST to help you Multiple-sequence alignment is
select analysis parameters and select the most the process of adding gaps to a
consistent, coherent tree ! matrix of data so that the
nucleotides in one column of
TWO CRUCIAL PARAMETERS: the matrix are related to each
other by descent from a
GAMMA RATE distribution for site classification common ancestral residue.
BOOTSTRAPPING algorithm as test of confidence

In addition to the data, the model Free parameters


scientist must choose a model
of sequence evolution.
GTR 8
Increasing model complexity
improves the fit to the data but
also increases variance in
estimated parameters. TN93 5

Model selection strategies


attempt to find the appropriate HKY85 F84 4
level of complexity on the basis
of the available data.

Model complexity can often F81 3


lead to computational
intractability.
K80 1
Newer models: JTT, WAG

JC69 0
Methods of Type of Data Traditional approaches
Reconstruction Distances Discrete characters
Neighbour-joining algorithm
Clustering UPGMA
algorithms Neighbor-Joining Tree searches with optimality criterion
Maximum parsimony
Optimization Maximum Parsimony Maximum likelihood
criterion Minimum Evolution Maximum Likelihood
Bayesian Bayesian Approaches

General Time-Reversible Substitution Models Parts of a phylogenetic tree


3 substitution types GTR Branch
Equal base frequencies Node
transversions, 2 transitions A C
TrN SYM
2 substitution types 3 substitution types
transversions vs transitions transversions, 2 transitions Root
HKY85 K3ST
Ingroup
2 substitution types
1 substitution type
transversions vs transitions
F81 K2P
G T
Equal base 1 substitution type
frequencies JC Outgroup

Homologs = Genes of common origin


Orthologs = 1. Genes resulting from a speciation event, 2. Genes originating
Gly Ala Ile Leu asp 3 different DNA positions but from an ancestral gene in the last common ancestor of the compared
genomes
Arg
-GGAGCCATATTAGATAGA- only one different amino acid
Co-orthologs = Orthologs that have undergone lineage-specific gene
position: duplications subsequent to a particular speciation event
-GGAGCAATTTTTGATAGA- 2 nucleotide substitutions are Paralogs = Genes resulting from gene duplication
Gly Ala Ile Phe asp therefore synonymous and one Inparalogs = Paralogs resulting from lineage-specific duplication(s)
is non-synonymous. subsequent to a particular speciation event
Arg
Outparalogs = Paralogs resulting from gene duplication(s) preceding a
particular speciation event
DNA yields more phylogenetic information than proteins. One-to-one (1:1) orthologs = Orthologs with no (known) lineage-specific gene
duplications subsequent to a particular speciation event
The nucleotide sequences of a pair of homologous genes
One-to-many (1:n) orthologs = Orthologs of which at least one - and at most
have a higher information content than the amino acid all but one - has undergone lineage-specific gene duplication subsequent
to a particular speciation event
sequences of the corresponding proteins, because
Many-to-many (n:n) orthologs = Orthologs which have undergone lineage-
mutations that result in synonymous changes alter the specific gene duplications subsequent to a particular speciation event
DNA sequence but do not affect the amino acid Pseudo-orthologs = Paralogs with lineage-specific gene loss of orthologs
sequence. Xenologs = Orthologs derived by horizontal gene transfer from another
lineage.
However, (But amino-acid sequences are more efficiently
Frog gene 1

Human gene 1 Orthologs Node: a branchpoint in a tree (a presumed ancestral OTU)


Branch: defines the relationship between the taxa in terms of descent
Mouse gene 1 and ancestry
Gene
duplication Paralogs Topology: the branching patterns of the tree
Mouse gene 2 Homologs Branch length (scaled trees only): represents the number of changes that
have occurred in the branch
Ancestral Human gene 2
Orthologs
Root: the common ancestor of all taxa
gene
Clade: a group of two or more taxa or DNA sequences that includes both
Frog gene 2 their common ancestor and all their descendents
Operational Taxonomic Unit (OTU): taxonomic level of sampling selected
Drosophila gene by the user to be used in a study, such as individuals, populations,
species, genera, or bacterial strains
Use homologies, not analogies! Branch
- Homology: common ancestry of two or more character states Spec ies A
Node Clade
Spec ies B
- Analogy: similarity of character states not due to shared ancestry
Roo Spec ies C
- Homoplasy: a collection of phenomena that leads to similarities in character t
Spec ies D
states
Spec ies E
for reasons other than inheritance from a common ancestor
(e.g. convergence, parallelism, reversal)

How to root? 2 4
A C
Cladogram
1 5

B 3 D

Phylogram
Using
outgroups

outgroup The branch length represents the


number of character changes and,
- the outgroup should be a taxon known to be less with calibration, the rate of
closely related to the rest of the taxa (ingroups) mutation.
- it should ideally be as closely related as possible to Molecular clocks require calibration
the rest of the taxa while still satisfying the above
condition
Use MEGA for distance-based & maximum likelihood trees
Phylogenetic tree options
describe your assumed
Phylogenetic trees model of evolution.

They correct for the aa


Taxa (n) rooted unrooted replacement model, site-
(2n-3)!/(2n-2(n-2)!) (2n-5)!/(2n-3(n-3)!) specific rates (gamma),
gap indels, and statistical
2 1 1
evaluation of
3 3 1
confidence.
4 15 3
5 105 15 The resulting tree(s)
6 954 105 should reflect know
7 10,395 954 species/gene bifurcation
8 135,135 10,395 order (topology), the
9 2,027,025 135,135 amount of evolution for
each taxn (branch
10 34,459,425 2,027,025
length), the branching
confidence (based on
bootstrap pseudo-
alignments

The Reverend Thomas Bayes 1701-1761


Bayesian Inference: The
explanation with the Bayesian phylogenetics
highest posterior probability
in practice, four tricks are used to make this possible
1. Definition of conditional probability
Pr(A and B) = Pr(A) Pr(B A) = Pr(B) Pr(A B) Markov chains MC
Monte-Carlo MC
Prior probability, the probability of the
Metropolis-coupled MC
hypothesis on previous knowledge

these are models of complex behaviours of systems, where


2. Bayes Theorem
Likelihood function, the output of one state of the system is independent of the
Pr(H) Pr(D H) previous experience of the system, and can be described by a
probability of the data
Pr(H D) =
Pr(D) given the hypothesis series of emission probabilities (per aligned site rates of
change, relative rates of base substitution)
Posterior probability, the Unconditional probability of the - used for estimating likelihoods
probability of the data, a normalizing constant
hypothesis given the data ensuring the posterior
probabilities sum to 1.00
Bayesian phylogenetics
MrBayes 3.1 or Phylobayes 3.3 or Beast 1.8
Bayesian approaches
Bayesian approach
Bayesian approaches to phylogenetics are relatively new, but they
Iterative process leading to improvement of trees and model
are already generating a great deal of excitement because the
parameters and that will provide the most probable trees
primary analysis produces a tree estimate and measures of (and parameter values)
uncertainty for the groups on the tree.
Complex models for amino acid changes:
The field of Bayesian statistics is closely allied with ML.
PAM and JTT, WAG (with correction for amino acid
Maximum likelihood vs. Bayesian estimation frequencies, but you have to type it !?!?!)
Maximum likelihood Correction for rate heterogeneity between sites (pinv,
discrete gamma, site specific rates)
search for tree that maximizes the chance of seeing the
Powerful parameter space search
data (P (Data | Tree))
Tree space (tree topologies)
Shape parameter (alpha shape parameter, pinv)
Bayesian inference Can work with large dataset
search for tree that maximizes the chance of seeing the tree Provides probabilities of support for clades (posterior
given the data (P (Tree | Data)) probabilities)

MrBayes 3.1
MrBayes 3.1: some options MrBayes will produce a population of trees and parameter values -
obtained by a Markov chain (mcmcmc). If the chain is working well
these will have converged to probable values
In practice we plot the results of an mcmcmc to determine the
region of the chain that converged to probable values. The burn
in is the region of the mcmcmc that is ignored for calculation of
the consensus tree
Trees and parameter values from the region of equilibrium are
used to estimate a consensus tree
The number of trees recovering a given clade corresponds to the
posterior for that clade, the probability that this clade exists
The mcmcmc uses the lnL function to compare trees between
generations
Support values for a given dataset and method are "posteriors"
typically higher then bootstrap and puzzle support values
Parsimony
Selection of (statistically) best-fit models of evolution In contrast to distance-based approaches, parsimony and
ProtTest3 http://darwin.uvigo.es/software/prottest3/ ML map the history of gene sequences onto a tree.
AIC (Akaike Information Criterion);
simple relationship between the likelihood and the number of parameters to estimate the distance of a model from truth.
In parsimony, the score is simply the minimum number of
BIC (Bayesian Information Criterion)
includes a penalty for the number of parameters to avoid overfitting of the selected model.
mutations that could possibly produce the data.

IQtree http://www.cibiv.at/software/iqtree/ (64-bit Parsimony has a few obvious disadvantages.


only) The score of a tree is completely determined by the
Auto-detection of best model (AIB & BIC) minimum number of mutations among all of the
Fast Maximum Likelihood analysis of large bootstrap reconstructions of ancestral sequences.
pseudoalignments. Another serious drawback of parsimony arises because it
fails to account for the fact that the number of changes is
unlikely to be equal on all branches in the tree.

Neighbour-joining algorithm Maximum likelihood

extremely popular, relatively fast. In ML, a hypothesis is judged by how well it predicts the
performs well when the divergence between sequences is low. observed data; the tree that has the highest probability of
good when evolutionary rates vary. Proven to construct the correct tree producing the observed sequences is preferred.
To use this approach, we must be able to calculate the
The first step in the algorithm is converting the DNA or protein sequences into probability of a data set given a phylogenetic tree.
a distance matrix that represents the evolutionary distance between
sequences. model of sequence evolution that describes the relative
probability of various events.
1 2 3 4 5
These probabilities take into account the possibility of
1 H959 -
unseen events.
2 H3847 0.00752 -
3 H5539 0.00809 0.01069 - From many perspectives, ML is the most appealing way to
4 H1067 0.00681 0.01593 0.01547 - estimate phylogenies. All possible mutational pathways that
5 H3368 0.00855 0.01126 0.01706 0.01505 - are compatible with the data are considered and the
likelihood function is known to be a consistent and powerful
A serious weakness for distance methods, is that the observed basis for statistical inference in general.
differences between sequences are not accurate reflections of the
evolutionary distances between them
Multiple substitutions (saturation of changes over time)
Principle Alpha parameter Scaling factor

Calculates likelihoods for each position in the


alignment and for all possible topologies (gaps
generally removed)
Result = tree with the highest (log) likelihood
Maximizes the likelihood of observing the sequence
data for a specific model of character state changes Infinitely large alpha value, rate variation is the same for all sites
alpha = 1, extensive rate variation
Maximized to estimate branch lengths, not topologies alpha < 1, many invariable sites

Probability density
Search strategies: rarely exhaustive, mostly heuristic
NNI (Nearest neighbor interchanges)
TBR (Tree bisection-reconnection)
SPR (Subtree pruning and regrafting)

Relative evolutionary rate


http://upload.wikimedia.org/wikipedia/commons/thumb/f/fc/Gamma_distribution_pdf.png

approximate Likelihood-Ratio Test (aLRT) for ML Bayesian approaches


is a statistical test to compute branch supports: It Bayesian approaches to phylogenetics are relatively new, but they are
uses the likelihood score of a branch to calculate already generating a great deal of excitement because the primary
analysis produces a tree estimate and measures of uncertainty for the
the approximate probability that a particular branch groups on the tree.
Very time-intensive

really exist in the true tree. It is much faster than Programs: Ugene-MrBayes; PhyloBayes (download, run parallel chains)

bootstrapping. http://www.phylogeny.fr
1.aLRT
2.Chi2: parametric branch support
3.aLRT-SH: non-parametric branch support based on a Shimodaira-Hasegawa-like procedure
4.aLRT Chi2 and SH: calculates parametric and non-parametric branch support; result is the minimum support of both Tree Tree Tree
methods topology 1 topology 2 topology 3

probability
1.0

Prior distribution

Data (observations)
probability

1.0

Posterior distribution
(MC)3

The field of Bayesian statistics is closely allied


with ML.

Maximum likelihood vs. Bayesian estimation


Maximum likelihood
search for tree that maximizes the chance of seeing the data (P (Data | Tree))

John Huelsenbeck
Bayesian inference
search for tree that maximizes the chance of seeing the tree given the data (P (Tree |
Data))

(MC)3
(MC)3

John Huelsenbeck John Huelsenbeck

Swap of states
Bootstrapping and MCMC
Bootstrapping and MCMC generate a sample of trees
generate a sample of trees Note that MCMC yields a much larger sample of trees in the
same computational time, because it produces one tree for
every proposal cycle versus one tree per tree search in the
traditional approach.
However, the sample of trees produced by MCMC is highly
auto-correlated.
As a result, millions of cycles through MCMC are usually
required, whereas many fewer (of the order of 1,000) bootstrap
replicates are sufficient for most problems.
Bayesian methods are exciting because they allow complex
models of sequence evolution to be implemented.
estimating divergence times
finding the residues that are important to natural selection
detecting recombination points

Evolutionary Divergence of Vertebrate Annexins


"ANXA" Family Tree Annexin Domain Ligands
T2-CBS ANX "ANXA"
OverlappingOrigins
ANX
KGD Dysfunctional ANX
ANX ANX
K/H/RGD KGD Adjacent KGD Replaced KGD

ANXA9 (18 species) ANX ANX


0.2 ANXA9 ANX ANX KGD KGD
62
97 ANXA2 (49) ANXA2 ANX ANX ANX
KGD
ANX
KGD

26 ANXA1 (64) ANXA1 ANX


ANX ANX K KGD ANX
KGD

ANX3'A6 (36) ANXA6 ANX ANX ANX ANX ANX ANX ANX ANX
57
ANXA10 (13) ANXA10 ANX ANX ANX ANX

32 ANX5'A6 (36) ANXA6 ANX ANX ANX ANX ANX ANX ANX ANX
77
53 ANXA5 (38) ANXA5 ANX ANX ANX ANX
KGD

52 ANXA8 (11) ANXA8 ANX ANX ANX ANX


22
ANX
ANXA3 (35) ANXA3 KGD
ANX ANX ANX
49
ANXA4 (32) ANXA4 ANX ANX ANX ANX
KGD

ANXA11 (45) ANXA11 ANX ANX ANX ANX

ANXA7 (27) ANXA7 ANX ANX ANX ANX


KGD

ANXA13 (n=35 species) ANXA13 Myr ANX ANX


KGD
ANX ANX
ANXA1 Homo sapiens (human)
ANXA1 Pan troglodytes (chimpanzee)
ANXA1 Pongo pygmaeus abelii (Sumatran orangutan)
ANXA1 Macaca mulatta (rhesus monkey)
ANXA1 Callithrix jacchus (white-tufted-ear marmoset)
Primates Annexin Origins in FAMILY C
ANXA1 phylogeny ANXA1 Otolemur garnettii (small-eared galago, Garnett's bushbaby)
Fungi
(112 vertebrates)
ANXA1 Mus musculus (house mouse)
ANXA1 Rattus norvegicus (Norway rat)
ANXA1 Peromyscus maniculatus bairdii (prairie deer mouse)
ANXA1 Erinaceus europaeus (western European hedgehog)
ANXA1 Oryctolagus cuniculus (rabbit)
Unicellular Organisms
ANXA1 Eubalaena glacialis (North Atlantic right whale)
ANXA1 Capra hircus (African dwarf goat)
ANXA1 Ovis aries (sheep)
ANXA1 Bos taurus (Holstein cow)
ANXA1 Sus scrofa (pig)
Rhizaria
0.2 aa repl/site
ANXA1 Halichoerus grypus (grey seal)
ANXA1 Felis catus (cat) Eutherian mammals Gymnophrys cometa
ANXA1 Canis familiaris (dog)
ANXA1 Equus caballus (horse)
ANXA1 Cavia cutleri (guinea pig)
ANXA1 Loxodonta africana (African savanna elephant)
FAMILY D
Metatherian mammals
ANXA1 Didelphis virginiana (North American opossum)
ANXA1 Monodelphis domestica (grey short-tailed opossum)
ANXA1 Ornithorhynchus anatinus (platypus) Prototherian mammals Chlorophyta
ANXA1b Columba livia (domestic pigeon)
ANXA1a Columba livia (domestic pigeon)
ANXA1 Gallus gallus (chicken) Aves (birds) Reticulomyxa filosa
ANXA1 Bothrops jararaca (jararaca)
ANXA1 Philodryas olfersii (green viper)
ANXA1 Anolis carolinensis (green anole)
ANXA1 Gekko japonicus
Lepidosauria
ANXA1e Ambystoma tigrinum (tiger salamander)
ANXA1e Ambystoma mexicanum (axolotl)
ANXA1d Ambystoma tigrinum (tiger salamander)
ANXA1d Ambystoma mexicanum (axolotl)
FAMILIA F
ANXA1c Cynops pyrrhogaster (Japanese firebelly newt)
ANXA1c Ambystoma tigrinum (tiger salamander)
ANXA1c Ambystoma mexicanum (axolotl)
ANXA1b Cynops pyrrhogaster (Japanese firebelly newt)
Apusozoa Proteobacteria
ANXA1b Rana catesbeiana (bullfrog)
ANXA1d Xenopus laevis (African clawed frog)
ANXA1c Silurana tropicalis (western clawed frog) Amphibia Thecamonas trahens
ANXA1c Xenopus laevis (African clawed frog)
ANXA1b Ambystoma mexicanum (axolotl)
ANXA1b Ambystoma tigrinum (tiger salamander)
ANXA1a Ambystoma mexicanum (axolotl)
ANXA1a Ambystoma tigrinum (tiger salamander)
ANXA1a Cynops pyrrhogaster (Japanese firebelly newt)
ANXA1b Silurana tropicalis (western clawed frog)
ANXA1b Xenopus laevis (African clawed frog)
ANXA1a Silurana tropicalis (western clawed frog)
ANXA1a Xenopus laevis (African clawed frog)
ANXA1a Rana catesbeiana (bullfrog)
ANXA1 Protopterus aethiopicus (marbled lungfish)
ANXA1b Tetraodon nigroviridis (freshwater green pufferfish) Dipnoi (lungfishes)
ANXA1b Takifugu rubripes (Japanese pufferfish)
ANXA1b Platichthys flesus (European flounder)
ANXA1b Dicentrarchus labrax (European sea bass)
ANXA1b Gasterosteus aculeatus (three spined stickleback)
ANXA1b Oryzias latipes (Japanese medaka)
ANXA1b Haplochromis chilotes (red tail sheller)
ANXA1b Gadus morhua (Atlantic cod)
ANXA1b3 Danio rerio (zebrafish)
ANXA1b2 Danio rerio (zebrafish)
ANXA1b Actinopterygii (ray-fin fishes) FAMILY B - Animals
ANXA1b1 Danio rerio (zebrafish)
ANXA1b2 Misgurnus anguillicaudatus (Japanese loach)
ANXA1b Cyprinus carpio (common carp)
(Amebozoa, Choanoflagellida,
ANXA1b Ctenopharyngodon idella (grass carp)
ANXA1b1 Misgurnus anguillicaudatus (Japanese loach)
ANXA1a Fundulus heteroclitus (Atlantic killifish)
ANXA1a Poecilia reticulata (guppy)
Ctenophora, Porfera)
ANXA1a Oryzias latipes (Japanese medaka)
ANXA1a Paralichthys olivaceus (Japanese flounder, bastard halibut)
ANXA1a Tetraodon nigroviridis (freshwater green pufferfish)
ANXA1a Takifugu rubripes (Japanese pufferfish)
ANXA1a Pagrus major (red seabream)
ANXA1a Dicentrarchus labrax (European sea bass)
ANXA1a Anoplopoma fimbria (candlefish)
ANXA1a Gasterosteus aculeatus (three-spined stickleback)
ANXA1a Lipochromis sp. 'matumbi hunter' (cichlid)
ANXA1a Haplochromis chilotes (red tail sheller)
FAMILY C
ANXA1a Platichthys flesus (European flounder)
ANXA1a Thalassophryne nattereri (Brazilian toadfish)
ANXA1a Gadus morhua (Atlantic cod) Oomycetes
ANXA1a3 Salmo salar (Atlantic salmon)
ANXA1a4 Oncorhynchus mykiss (rainbow trout)
ANXA1a3 Oncorhynchus mykiss (rainbow trout)
ANXA1a Actinopterygii (ray-fin fishes)
ANXA1a Salvelinus alpinus (Arctic char)
ANXA1a2 Oncorhynchus mykiss (rainbow trout)
ANXA1a2 Salmo salar (Atlantic salmon)
ANXA1a Oncorhynchus tshawytscha (Chinook, king salmon)
ANXA1a1 Oncorhynchus mykiss (rainbow trout)
ANXA1a1 Salmo salar (Atlantic salmon)
ANXA1a Esox lucius (northern pike)
ANXA1a Oreochromis mossambicus (Mozambique tilapia)
ANXA1a2 Cyprinus carpio (common carp)
ANXA1a1 Cyprinus carpio (common carp)
ANXA1a Rutilus rutilus (roach)
ANXA1a Misgurnus anguillicaudatus (Japanese loach)
ANXA1a2 Danio rerio (zebrafish)
ANXA1a1 Danio rerio (zebrafish)
ANXA1a2 Ictalurus punctatus (channel catfish)
ANXA1a1 Ictalurus punctatus (channel catfish)
ANXA1a Ictalurus furcatus (blue catfish) FAMILY E
ANXA1 Acipenser transmontanus (white sturgeon) Chondrostei (bony fishes)
ANXA1 Leucoraja erinacea (little skate)
ANXA1 Squalus acanthias (dogfish shark)
ANXA1 Callorhinchus milii (ghost shark)
ANXA1 Petromyzon marinus (sea lamprey)
Chondrichthyes (cartilaginous fishes) Protista
Hyperoartia (jawless fishes)

REQUIREMENTS: Valid homologs, correctly aligned.

TRADITIONAL APPROACHES: The molecular sequences


Neighbour-joining algorithm (distance scores) (nt or aa) identified by
Parsimony - Tree searches with optimality criterion name are sorted
Maximum likelihood (pairwise comparison of all sites in all sequences) according to various
http://atgc.lirmm.fr/phyml/ criteria, depending on the
Bayesian approaches (posterior probability model) phylogenetic program
i.e. Choose appropriate model of nucleotide substitution if you know your data composition computational algorithm.
and its the patterns of evolution; All attempt to achieve the
or use MODELTEST or PROTEST (http://darwin.uvigo.es/software/prottest2_server.html) to true duplication/speciation
help you select analysis parameters and select the most consistent, coherent tree !
order, amounts and rates
TWO CRUCIAL PARAMETERS:
GAMMA RATE distribution for site classification
of evolution.
BOOTSTRAPPING algorithm as test of confidence
Node: a branchpoint in a tree (a presumed ancestral OTU)
Branch: defines the relationship between the taxa in terms of Frog gene 1

descent and ancestry Human gene 1 Orthologs


Topology: the branching patterns of the tree
Branch length (scaled trees only): represents the number of Gene
Mouse gene 1
duplicatio Paralogs
changes that have occurred in the branch n Mouse gene 2 Homologs

Root: the common ancestor of all taxa Ancestral Human gene 2


Clade: a group of two or more taxa or DNA sequences that gene Orthologs

includes both their common ancestor and all their descendents Frog gene 2

Operational Taxonomic Unit (OTU): taxonomic level of sampling Drosophila gene

selected by the user to be used in a study, such as individuals, Use homologies, not analogies!
populations, species, genera, or bacterial strains - Homology: common ancestry of two or more character states

- Analogy: similarity of character states not due to shared ancestry


Branch
Spec ies A - Homoplasy: a collection of phenomena that leads to similarities in character states for reasons other
Node than inheritance from a common ancestor (e.g. convergence, parallelism, reversal)
Spec ies B
Ingroup Clades
Root Spec ies C

Spec ies D

Spec ies E Outgroup Clade

Phylogenetic trees
How to root? 2 4
A C

1 5

B 3 D

Branch
Node Taxa (n) rooted unrooted
Using (2n-3)!/(2n-2(n-2)!) (2n-5)!/(2n-3(n-3)!)
outgroups
2 1 1
Root 3 3 1
4 15 3
Ingroup 5 105 15

outgroup 6 954 105


7 10,395 954
- the outgroup should be a taxon known to be less closely related to the rest of the taxa (ingroups), 8 135,135 10,395
e.g. earlier diverging 9 2,027,025 135,135
- it should ideally be as closely related as possible to the rest of the taxa while still satisfying the above Outgroup 10 34,459,425 2,027,025
condition, e.g. not too distant homolog(s).
Use MEGA for distance-based & maximum likelihood trees The Reverend Thomas Bayes 1701-1761
Bayesian Inference: The
explanation with the
Phylogenetic tree
options describe your highest posterior probability
assumed model of
evolution. 1. Definition of conditional probability
Pr(A and B) = Pr(A) Pr(B A) = Pr(B) Pr(A B)
They correct for the aa
replacement model,
site-specific rates Prior probability, the probability of the
(gamma), gap indels, hypothesis on previous knowledge
and statistical evaluation
of confidence.
2. Bayes Theorem
The resulting tree(s) Likelihood function,
Pr(H) Pr(D H)
should reflect know probability of the
species/gene bifurcation Pr(H D) =
Pr(D) data given the
order (topology), the
amount of evolution for
hypothesis
each taxn (branch Posterior probability, the Unconditional probability of
length), the branching probability of the the data, a normalizing
confidence (based on hypothesis given the data constant ensuring the posterior
bootstrap pseudo-
probabilities sum to 1.00
alignments

Bayesian phylogenetics MrBayes 3.1 or Phylobayes 3.3 or Beast 1.8


Bayesian approaches to phylogenetics are relatively new, but they are already
generating a great deal of excitement because the primary analysis produces a tree Bayesian approach
estimate and measures of uncertainty for the groups on the tree. Iterative process leading to improvement of trees and model parameters and that
will provide the most probable trees (and parameter values)
The field of Bayesian statistics is closely allied with ML.
Complex models for amino acid changes:
PAM and JTT, WAG (with correction for amino acid frequencies)
Maximum likelihood vs. Bayesian estimation Correction for rate heterogeneity between sites (pinv, discrete gamma, site
specific rates)
Maximum likelihood
Powerful parameter space search
search for tree that maximizes the chance of seeing the data (P (Data | Tree)) Tree space (tree topologies)
Shape parameter (alpha shape parameter, pinv)
Bayesian inference Can work with large dataset

search for tree that maximizes the chance of seeing the tree given the data (P Provides probabilities of support for clades (posterior probabilities)
(Tree | Data))
MrBayes 3.1 Explanation of Results
MrBayes 3.1: some options MrBayes will produce a population of trees and parameter values - obtained by a Markov
chain (mcmcmc). If the chain is working well these will have converged to probable
values.
In practice we plot the results of an mcmcmc to determine the region of the chain that
converged to probable values. The burn in is the region of the mcmcmc that is ignored for
calculation of the consensus tree
Trees and parameter values from the region of equilibrium are used to estimate a
consensus tree
The number of trees recovering a given clade corresponds to the posterior for that
clade, the probability that this clade exists
The mcmcmc uses the lnL (log likelihood) function to compare trees between
generations
Support values for a given dataset and method are "posteriors" typically higher than
bootstrap and puzzle support values.

(MC)3

Selection of (statistically) best-fit models of evolution


ProtTest3: http://darwin.uvigo.es/software/prottest3/
AIC (Akaike Information Criterion);
simple relationship between the likelihood and the number of parameters
to estimate the distance of a model from truth.
BIC (Bayesian Information Criterion)
includes a penalty for the number of parameters to avoid overfitting of the
John Huelsenbeck selected model.

Iqtree: http://www.cibiv.at/software/iqtree/ (64-bit only)


Swap of states Auto-detection of best model (AIB & BIC)
Fast Maximum Likelihood analysis of large bootstrap pseudoalignments.
Maximum likelihood Principle
In ML, a hypothesis is judged by how well it predicts the observed data; Calculates likelihoods for each position in the alignment and for all possible
the tree that has the highest probability of producing the observed sequences is preferred. topologies (gaps generally removed)
Result = tree with the highest (log) likelihood
To use this approach, we must be able to calculate the probability of a data set given a
phylogenetic tree. Maximizes the likelihood of observing the sequence data for a specific model
of character state changes
model of sequence evolution that describes the relative probability of various events.
Maximized to estimate branch lengths, not topologies
These probabilities take into account the possibility of unseen events.

From many perspectives, ML is the most appealing way to estimate phylogenies. All Search strategies: rarely exhaustive, mostly heuristic
possible mutational pathways that are compatible with the data are considered and the NNI (Nearest neighbor interchanges)

likelihood function is known to be a consistent and powerful basis for statistical inference in TBR (Tree bisection-reconnection)
general. SPR (Subtree pruning and regrafting)

Alpha parameter Scaling factor - p-values directly


- no need for bootstrapping

Infinitely large alpha value, rate variation is the same for all sites
alpha = 1, extensive rate variation
alpha < 1, many invariable sites
Probability density

Relative evolutionary
rate
http://upload.wikimedia.org/wikipedia/commons/thumb/f/fc/Gamma_distribution_pdf.png
Key residue prediction using subfamily and
family-wide conservation analysis
Y221

W222
D558 Discover
R627
H745
bacterial
D628
G629
Y743 A744 annexins by
pHMM search,
alignment, tree,
3D modeling
Parker JS, Roe
SM, Barford D. , and analysis in
EMBO J., 2004

Tanaka Hall, T.
Chimera
D RD E YAH Structure 2005

Rivas et al, 2005

MAPT 3D structure (Predicted) exons, conservation, hydrophobicity, electric potential


Start by opening a XXXX.PDB file.
Select the sites you want to change by Menu selection tool, Tools
Sequence, Command line select :1-n, Crtl+MouseLeftClick of screen, etc.
Modify selection by atom style, color, labels.
Save session by name.
Export image as bitmap (or Pov-Ray).

SPECIAL TRICKS:
1. Incorporate electric potential data, using the web server for the Adaptive
Poisson-Boltzmann Solver (APBS): http://nbcr-222.ucsd.edu/pdb2pqr_2.0.0/
Enter a PDB code and select force-field parameters to process .pdb to files
PDB.pqr, PDB.propka and PDB.in. Continue by launching APBS to calculate
values in PDB.pqr to PDB-pot-PE0.dx(.gz) for use as input to Chimera.

2. Incorporate residue conservation pattern from Consurf


http://consurf.tau.ac.il/
Annexin 3D models to visualize molecular surface topology, Role of Evolution
hydrophobicity patches, electrostatic charge potential distribution, for a successful paradigm
and ligand presentation.

In Biology, nothing
makes sense except in
the light of Evolution
Theodosius Dobzhansky,
ANXA9 ANXA10- Cytophaga ANXF hydrophobicity Russian geneticist
KGD+hydrophobicity Cys+hydrophobicity

It is not necessary to
accept everything as
true, one must only
ANXA9 electrostatic potential ANXA10 electrostatic potential Cytophaga ANXF electrostatics accept it as necessary

Franz Kafka, The Trial

- Poster credits: Ilumina, Genome


Haliangium ANXF2 Cytophaga annexin F1 147 aa Research, UCSC Web-browser

GRUPO-PAR TG# GEN-PROTEINA FENOTIPO-ENFERMEDAD INTERS


Albert Einstein (el
ABCC9 (1549 aa) ATP-binding cassette, sub-family C Dormiln dormiln ABCC8+9)
ABCC8 (1581 aa) (CFTR/MRP), member 9

MTR - miraculin Cambia sabor cido a dulce Alimentacin, diabetes


(Uva 220 aa)
Dulce MTR
DMD (3685 aa) Distrofina de Duchenne. Distrofa muscular, Distrofa GENOMIC DISEASE OR DISTRIBUTIO SAMPLE GENE(S) Focus BIOINFORMATICS
DMPK (629 aa) Distrofa muscular protena kinasa miotnica (polyCTG en 3UTR)
TOPIC PHENOTYPE (OMIM) N(GEO, REFERENCE (NCBI-Gene) (homologs, align, pHMM,
FAM63AB (469-621) Desconocido Rion? GeneCards) (PubMed) phylotree, 3D, SDP, DOCK
FXN (210 aa) Frataxina, Ataxia de Friedreich Transporte de hierro
mitocondrial, intronic polyGAA
1 Epigenetic Angelman, Prader-Willi Human Lewis et al., SNRPN (AS, Imprinting center, regulatory
GFP (238 aa) Protenas marinas fluorescente Fluorescencia (medusas,
Aequorea victoria)
GFP Imprinting 2014; PMID: PWS) 240 aa signals
GULO (440 aa) Biosntesis de vitamina C Falta en algunos primates y 25378697
murcilagos
GULO
HEXA (529 aa) hexosaminidase A (alpha Tay-Sachs (mutaciones en la 2 Megaviruses Cellular targets Pithovirus, Legendre- Present and Genome organization (maps)
polypeptide) subunidad alpha)
from the 4th Mimivirus, 2014 absent and phylogenetic
HTT (3142 aa) Huntingtin Huntington, neurodegeneracin,
domain Pandoravirus relationships.
polyGln (triplet codificante)

LCT (1927 aa) Lactasa Persistencia / deficiencia - 3 MAPT locus Neurodegenerative Human Itsara-2012 KANSL1 Microdeletion, duplication
Neandertal, Denisovan, Clovis
Diseases MAPT-link
POMC (267 aa) Enkephalin, Endorphin, Dynorphin Opiceos
4 Triplet Huntington-coding CAG Human Marmolino- HTT 3144 Etiology and consequences,
HBB (147 aa) beta hemoglobina, HbS, -
thalassemia, etc.
Variantes en afinidad a O2
Globinas expansion polyglutamine; 2012 aa; alternative mechanisms
CHRNA5 (468 aa) Neuronal Acetylcholine Receptor,
Nicotinic, Alpha 5
Contribucin al adiccin Friedreichs ataxia FXN 210 aa
MAP-Tau (< 758 aa) Neurodegeneracin Proteina malplegada y agregada
(intronic GAA)
en neuronas centrales
Pez de hielo 5 Viral Invasion, proliferation Human Longdon- BST Interaction specificity
STH (128 aa) Saitohin Gen dentro del gen MAPT
cocodrilo receptors 2014 (tetherin)
Taq-Pol (834 aa) DNA polymerase I (polA Enzima extremo
[Thermus thermophilus HB27] 6 Melanocortin Hair color, melanoma Vertebrates ASIP agouti Epigenetic basis
Trefoil TFF1,2,3 Trefoil factors, moco Papel de las cysteinas en la signalling 138 aa
(84, 129, 80 aa) mucosa

KIF1B (1816 aa) kinesin family member 1B Charcot-Marie-Tooth (CMT2A1)


7 Antiaging Aging, cancer Vertebrates Yu-2014; EPCAM Properties, role in skin
ANXA8, ANXA8L1, Subfamilia de las anexinas, Origen de las duplicaciones,
ANXA8L2 (327 aa) amplificada solo en humanos ancestro, papel genes PMID:
ANXA10 324 aa) Subfamilia de las anexinas Sobreexpresada en la mucosa 22073188
del estmago Aye-Aye de
Mir-508, Alu, MicroRNA, epigentica Origen, funcin, dianas Madagascar
HAT/HDAC
GENOMIC DISEASE OR DISTRIBUTION SAMPLE GENE(S) Focus BIOINFORMATICS
TOPIC PHENOTYPE (OMIM) (GEO, REFERENCE (NCBI-Gene) (homologs, align, pHMM, GENOMIC DISEASE OR DISTRIBUTIO SAMPLE GENE(S) Focus BIOINFORMATICS
GeneCards) (PubMed) phylotree, 3D, SDP, DOCK TOPIC PHENOTYPE (OMIM) N(GEO, REFERENCE (NCBI-Gene) (homologs, align, pHMM,
GeneCards) (PubMed) phylotree, 3D, SDP, DOCK
8 Siamese Melanin biosynthesis Cats Lyons et al., TYR 377 aa Tyrosinase aa conservation at
coloring 2005 (PMID: catalytic sites, key color Smallest Parasitic micro- Encephalitozo Corradi-2010 2000 genes Analyze 1 protein family
15771720) mutations. genome sporidiosis on intestinalis
9 Antifreeze Cold adaptation Antarctic fish Chen-1997 AFP/AGFP or Evolution; Physicochemistry
proteins notothenioid MAPT Largest Primitive angiosperm Paris japonica Pellicer-2010 Repetitive Find protein in UniProt
genome elements?
10 Fluorescent Coloring Marine species Sparks-2014 Fis vs GFP Protein 3D structure and color
fish proteins mutations. RNA-BP (CLIP) Neurodegeneration Human Hudson- ELAV-RBP pHMM of protein-RNA
2014; interaction site
11 Antibiotic Antibacterial Vertebrates Mechkarska - Magainins Define active sites, Pascale-2012
defense peptides 2014; PMID: LOC100489 mechanism
25086320 132
Epigenetics, Development, aging, Human Chen-2013; Histone H3 Functional differences
12 Chocolate Color & taste Theobromus Motamayor- Functional profile of histone cancer Ederveen et variants
characteristic determinants cacao 2013 associated protein variants al., 2011
13 Lactase Lactose intolerance Mammals Wray et al., LCT European origin of regulatory Genome X-linked Human Speckmann- BIRC4, XIAP Map mutation(s)
persistence 2007 SNP mutation for LCT in therapy lymphoproliferative 2013; Lukacs- 497 aa
neighbor gene disorder 2013
14 Repetitive Deleterious Human GIRI Alu, Locus mapping, basis of SUMOylation various Human Lamoliatte- SUMO (small pHMM, interactions
elements transposon transposition profiling 2014 ubiquitin-rel.
15 Human gene Human NCBI e.g. ANXA8 Locus mapping, functional modifiers)
duplications consequences

GENOMIC TOPIC DISEASE OR DISTRIBUTIO SAMPLE GENE(S) Focus


PHENOTYPE N(GEO, REFERENCE (NCBI-Gene) BIOINFORMATICS Some Known Genetic Disorders
(OMIM) GeneCards) (PubMed) (homologs, align,
P Point mutation, or any insertion/deletion entirely inside one gene
pHMM, phylotree, 3D,
D Deletion of a gene or genes
SDP, DOCK C Whole chromosome extra, missing, or both (see chromosomal aberrations)
T Trinucleotide repeat disorders: gene is extended in length
DFTD Immune evasive Tasmanian Siddle-2013 ?
cancer devil cancer http://www.ncbi.nlm.nih.gov/omim/ http://en.wikipedia.org/wiki/List_of_genetic_disorders
SINGLE-GENE DISORDERS # LIVEBORN INFANTS DISORDER MUTATION CHROMOSOME
Paraspeckles Subnuclear Human Fox-2010 RNA-BP ANXA10 Role, interactions
AUTOSOMAL DOMINANT 22q11.2 deletion syndrome D 22q
interchromatin
Familial hypercholesterolemia 1 in 500 Angelman syndrome DCP 15
Ancient (extinct) Unusual Human Rasmussen- MAPT H1-H2 Comparative genomics Polycystic kidney disease 1 in 1250 Canavan disease 17p
humans: Saqqaq, characteristics 2010 haplotypes Neurofibromatosis Type I 1 in 2,500 Coeliac disease
Neanderthal & Hereditary spherocytosis 1 in 5,000 CharcotMarieTooth disease
Denisovan
Marfan syndrome 1 in 4,000 Color blindness P X
Black plague Cause of virulence Yersinia Comparative strains Huntington's disease 1 in 15,000 Cri du chat D 5
pestis AUTOSOMAL RECESSIVE Cystic fibrosis P 7q
Spanish Flu Cause of virulence Influenza Tumpey et al., No seq Sequence differences Sickle cell anemia 1 in 625 (African) Down syndrome C 21
virus 2005; Duchenne muscular dystrophy D Xp
Cystic fibrosis 1 in 2,000 (Caucasians)
Taubenberger-
1 in 40,000 Haemochromatosis P 6
2012 Lysosomal Acid Lipase (LAL) Deficiency Haemophilia P X
lncRNAs Development & Human Bergmann- Species comparison? Mechanism, Klinefelter's syndrome C X
Tay-Sachs disease 1 in 3,000 (Amer. Jews)
Disease 2014 interactions Neurofibromatosis 17q/22q/?
Batista-2013 Phenylketonuria 1 in 12,000
Phenylketonuria P 12q
Mucopolysaccharidoses 1 in 25,000 16 (PKD1) or 4
Genes, proteins Neurodegenerativ Stessman HAF, SCN2A, ARID1B, Top 12 candidate Polycystic kidney disease P
Glycogen storage diseases 1 in 50,000 (PKD2)
e Diseases Nature Genet. ADNP, CHD8, genes causing NDD
Galactosemia 1 in 57,000 PraderWilli syndrome DC 15
2017 SYNGAP1, POGZ,
X-LINKED Sickle-cell disease P 11p
DYRK1A, CTNNB1,
Duchenne muscular dystrophy 1 in 7,000 TaySachs disease P 15
ANKRD11, NAA15,
Hemophilia 1 in 10,000 Turner syndrome C X
MED13L, FOXP1
Some Trinucleotide Repeat Genetic Disorders
Normal Pathogenic
Type Gene PolyQ PolyQ
repeats repeats Normal Pathogenic
Type Some Trinucleotide Repeat Gene
Genetic Disorders
DRPLA (Dentatorubropallidoluysian PolyQ repeats PolyQ repeats
ATN1 or DRPLA 6 - 35 49 - 88
atrophy)
HD (Huntington's disease) HTT (Huntingtin) 10 - 35 35+ DRPLA (Dentatorubropallidoluysian atrophy) ATN1 or DRPLA 6 - 35 49 - 88
SBMA (Spinobulbar muscular atrophy or Androgen receptor on HD (Huntington's disease) HTT (Huntingtin) 10 - 35 35+
9 - 36 38 - 62 Polyglutamine (PolyQ) Diseases
Kennedy disease) the X chromosome. SBMA (Spinobulbar muscular atrophy or Kennedy Androgen receptor on the X Polyglutamine
9 - 36 38 - 62
SCA1 (Spinocerebellar ataxia Type 1) ATXN1 6 - 35 49 - 88 with "CAG" codon extension disease) chromosome. (PolyQ) Diseases
SCA2 (Spinocerebellar ataxia Type 2) ATXN2 14 - 32 33 - 77 (most common) SCA1 (Spinocerebellar ataxia Type 1) ATXN1 6 - 35 49 - 88 with "CAG" codon
SCA3 (Spinocerebellar ataxia Type 3 or SCA2 (Spinocerebellar ataxia Type 2) ATXN2 14 - 32 33 - 77 extension (most
ATXN3 12 - 40 55 - 86
Machado-Joseph disease) SCA3 (Spinocerebellar ataxia Type 3 or Machado- common)
ATXN3 12 - 40 55 - 86
SCA6 (Spinocerebellar ataxia Type 6) CACNA1A 4 - 18 21 - 30 Joseph disease)
SCA7 (Spinocerebellar ataxia Type 7) ATXN7 7 - 17 38 - 120 SCA6 (Spinocerebellar ataxia Type 6) CACNA1A 4 - 18 21 - 30
SCA17 (Spinocerebellar ataxia Type 17) TBP 25 - 42 47 - 63 SCA7 (Spinocerebellar ataxia Type 7) ATXN7 7 - 17 38 - 120
SCA17 (Spinocerebellar ataxia Type 17) TBP 25 - 42 47 - 63
Non-PolyQ Diseases Type Gene Codon Normal/wildtype Pathogenic

Type Gene Codon Normal/wildtype Pathogenic FRAXA (Fragile X syndrome) FMR1, on the X-chromosome CGG 6 - 53 230+
Non-PolyQ
FXTAS (Fragile X-associated tremor/ataxia syndrome) FMR1, on the X-chromosome CGG 6 - 53 55-200
FRAXA (Fragile X syndrome) FMR1, on the X-chromosome CGG 6 - 53 230+ Diseases
FRAXE (Fragile XE mental retardation) AFF2 or FMR2, on the X-chromosome GCC 6 - 35 200+
FXTAS (Fragile X-associated tremor/ataxia
FMR1, on the X-chromosome CGG 6 - 53 55-200
syndrome) FRDA (Friedreich's ataxia) 9q21 intron FXN or X25, (frataxin) GAA 7 - 34 100+
DM1 (Myotonic dystrophy) DMPK CTG 5 - 37 50+
AFF2 or FMR2, on the X-
FRAXE (Fragile XE mental retardation) GCC 6 - 35 200+ SCA8 (Spinocerebellar ataxia Type 8) OSCA or SCA8 CTG 16 - 37 110 - 250
chromosome
SCA12 (Spinocerebellar ataxia Type 12) PPP2R2B or SCA12 nnn On 5' end 7 - 28 66 - 78
FRDA (Friedreich's ataxia) 9q21 intron FXN or X25, (frataxin) GAA 7 - 34 100+
DM1 (Myotonic dystrophy) DMPK CTG 5 - 37 50+
SCA8 (Spinocerebellar ataxia Type 8) OSCA or SCA8 CTG 16 - 37 110 - 250
SCA12 (Spinocerebellar ataxia Type 12) PPP2R2B or SCA12 nnn On 5' end 7 - 28 66 - 78

Biosntesis enzimtica de la vitamina C


Cis-regulatory mutations with interesting phenotypic consequences* Plantas y levaduras Gulonolactone deshidrogenasa EC 1.1.3.8 (gen de la vaca)
Wray, GA. (2007). The evolutionary significance of cis-regulatory mutations. Nature Rev Genet 8: 206-216.
Gene Function of product Phenotype Taxon References (TBLASTn con la proteina NP_001029215 440 aa)
AVPR1A Vasopressin receptor Creative dance performance Humans 82
Avpr1a Vasopressin receptor Paternal care Rodents 83
Cyp6G1 P450 enzyme Pesticide resistance Fruitflies 84
DARC Chemokine receptor Resistance to infection with malaria Humans 58,59
e Pigment synthesis Colour pattern of abdomen Fruitflies 25
hsp70 Heat shock protein Thermal tolerance Fruitflies 85,86 Bos taurus
HTR2A Serotonin receptor Obsessive-compulsive behavior Humans 87 Hereford chr. 8
IL10 Interleukin Outcome of infection with HIV and infection with leprosy Humans 88,89
IL10 Interleukin Susceptibility to schizophrenia Humans 90 En mamferos
LCT Digestive enzyme Lactose persistence Humans 64,81
LDH Metabolic enzyme Cardiac physiology Killifish 91
ovo/svb Transcription factor
MAOA
Bristle pattern on larvae
Neurotransmitter turnover Aggressive behavior
Fruitflies
Humans
34,36
92,93
Homo sapiens
MMP3 Matrix metalloprotease Risk of heart disease Humans 94,95 chr. 8
PDYN Neuropeptide Memory, emotional status Humans 71,76
pitx1 Transcription factor Skeletal patterning Stickleback fish 37,39,40
sc Transcription factor Bristle pattern on adult notum Fruitflies 35,96
SLC6A4 Serotonin transporter Depression, creativity, anxiety Humans 82,97
SLC6A4 Serotonin transporter Dispersal behavior Macaques 98
tb Transcription factor Branching structure Maize 99,100
Ubx Transcription factor Bristle pattern on adult legs Fruitflies 101
y Pigment synthesis Colour pattern of cuticle Mating behaviour Fruitflies 2528,102
Protein alignment of feline Tyrosinase (TYR) and other species.
The cat mutation consistent with the pointed phenotype is the G302R in exon 2 and
the mutation consistent with the Burmese pattern of gradient shading is the exon 1
G227W mutation.

CNV Copy number variation in HUMAN annexin A8

Bioinformatic Studies in Genome Biotechnology


Useful Webservers SMART Protein Domains Topic 1: Genomic imprinting through epigenetic silencing.
http://smart.embl-heidelberg.de/ Known domains in annexins identified by Genes involved: 15q11-13 genes, maternally expressed UBE3A, paternally expressed
SMART pHMM SNRPN and NDN.
Syndromes associated: Angelman syndrome (loss of maternal UBE3A expression).
Prader-Willi syndrome (loss of paternal SNRPN and/or NDN expression).

Information mining: Epigenetic targets and mechanisms; Imprinting center; Syndrome etiology,
pathogenesis;

References: Lewis MW et al. (2014) PMID: 25378697

Bioinformatic analyses:
1. Blast search and retrieve ca. 100 diverse homologs of SNRPN (Blast,Hmmer). Compile FASTA library file
(save GenBank annotations separately).
2. Align proteins (ClustalO, Muscle, Cobalt), manually revise (Ugene) and export in various formats (aln, phy,
meg, afa) from BUGACO.
3. Analyze alignment to produce pHMM (HMMER 3.1b) and create sequence logo (Skylign). Interpret
conservation pattern to infer site-specific functionality (SDP also).
4. Perform phylogenetic analyses (Mega-NJ, RAxML, Bayesian) with bootstrapping and gamma rates to
confirm homology, distinguishing orthologs from paralogs, estimate rates, assess coherence with known
speciation order.
5. Retrieve 3D structure (RCSC-PDB) to model in Chimera. Incorporate physicochemical and evolutionary
information
Proteinas marinas fluorescentes en invertebrados y vertebrados

Encephalitozoon intestinalis has the smallest eukaryotic LARGEST GENOMES


Pinus taeda (loblolly pine) 22.18 Bbp
Fluorescent glowfish begin to light up the oceans !
genome (2.25 Mbp) sequenced in 2010. Paris japnica (canopy lily) 150 Bbp
It is a parasite that can cause microsporidiosis.

Genomas de Inters Longevity of the naked mole rat (Heterocephalus glaber )

The (ugly, subterranean) naked mole rat


Heterocephalus glaber genome has been
sequenced (Kim et al., 2011; Keane et al., 2014).
It is a thermoconformer (body temperature
tracks ambient temperatures), lacks pain
sensation in its skin, and has very low metabolic
and respiratory rates. It is also remarkable for its
resistance to cancer and its longevity.

Hutchinson-Gilford Progeria (envejecimiento


avanzado) causado por una mutacin en el gen
LMNA, cuyo proteina da soporte funcional al
ncleo cellular.
http://www.sciencedirect.com/science/article/pii/S
0022347672802294
Puffer Fish
Platypus Fuguthe poisonous puffer fish sought after by brave
Biologists have long considered the platypus a suchi-eatershas the smallest known vertebrate
fascinating creature, resembling a hodgepodge of genome. When researchers unraveled its genetic
different animal parts. And in 2008, when structure in 2002, they found that 75 percent of its
researchers at Louisiana State University sequenced genes have direct human counterparts, even though the
the platypus genome, they discovered that its DNA is fish and humans diverged from their common ancestor
actually a mash-up of mammalian, avian, and over 400 million years ago. By comparing human and
reptilian features. This discovery supports the idea fugu genomes, researchers found almost 1,000
that the platypus represents an ancient branch on previously unidentified human genes.
the mammalian tree.
Image: Flickr/Quinet
Image: Wikimedia Commons/Stefan Kraft

Some Useful Genomic Websites for


Sequence Data and Bioinformatic Tools
Major sequence databases and tools:
NCBI (USA) http://www.ncbi.nlm.nih.gov/
UCSC Goldenpath Browser (USA) http://genome.ucsc.edu/ - Sequence search and comparison methodology, parameter values,
Broad Institute (USA) http://www.broadinstitute.org/ homolog selection criteria.
DOE Joint Genome Institute (USA) http://genome.jgi-psf.org/
Washington U. Genome Inst. (USA) http://genome.wustl.edu/
- Alignment methods, parameters, and formats. Editing criteria.
J. Craig Venter Institute (USA) http://www.jcvi.org/ - Build and explanation of pHMM, applications to sequence analysis.
Sanger (Ensembl) (UK) http://www.sanger.ac.uk/ - Information content of Skylign logo, specific for family studied.
European Bioinformatics Institute (UK) http://www.ebi.ac.uk/
Climb, GigaDb (China) http://climb.genomics.cn/ - Phylogenetic analyses: 3 basic approaches (NJ, ML, Bayesian) why?
KEGG (Japan) http://www.genome.jp/kegg/ - Phylogeny model & parameter selection, basic algorithms, bootstrap (or
Bioinformatic software:
posterior probability), gamma rates and alpha parameter.
Unipro UGENE (HMMER, etc) http://ugene.unipro.ru/ - Evaluation of trees species order, outgroup, confidence values, branch lengths,
MEGA 6 (phylogenetic analyses) http:// http://www.megasoftware.net/ phylogram vs cladogram.
FASTA (download package) http://fasta.bioch.virginia.edu/fasta_www2/fasta_list2.shtml
CLUSTAL Omega (EBI server) http://www.ebi.ac.uk/Tools/msa/clustalw2/ - 8 (max10) pages, 12 pt, 1.5 lines, figuras ejemplares, PDF o PPT format, by e-
LOGOMAT-M & SKYLIGN (servers) http://www.sanger.ac.uk/resources/software/ mail vie 6-6-2014, work equally divided between participants.
Phylogenetic analysis (server) http://www.phylogeny.fr/version2_cgi/index.cgi
RAxML (ML phylogeny server) http://phylobench.vital-it.ch/raxml-bb/ - Preguntas, ayudas para todos: lunes 2-junio-2014 12:00 14:00 (-email)
VISTA tools for comparative genomics http://genome.lbl.gov/vista/index.shtml
World Tour of Genomic Resources http://www.openhelix.com/cgi/tutorialInfo.cgi?id=119
1. BSQUEDA de homlogos:
a) BLASTP-NCBI-nr y pHMMER-UniProt pgina de resultados (grfica, lista de
homlogos+estadsticas, y sus alineamientos) exportados en un archivo de texto.
b) Bajar de todas las secuencias homlogas (en formatos .FA y .GB).
c) Seleccin final (razonada) de 150+ ortlogos, ms algunos parlogos y un outgroup, bien
distribuidos. Exportar en un archivo .FA.
2. ALINEAMIENTOS de secuencias mltiples:
NCBI-COBALT, EBI-CLUSTALO.ALN o .AFA, T->M-Coffee (comparacin).
Un archivo del alineamiento definitivo de 150.ALN o basado en HMMALIGN o el .HMM de JACKHMMER
(bajar full-length HMM de todos los homlogos).
Incluir resumen del algoritmo, sus parmetros y los formatos en el report.
3. Modelo(s) PROT.hmm hecho del mejor alineamiento arriba.
Logo de secuencias alineadas por SKYLIGN, con explicacin del signigicado (identificar aa, motivos,
dominios conservados en funcin).
Aplicacin de HMMSEARH para el ranking de homlogos en relacin con el pHMM
4. Arboles filogenticos: 3 abordajes bsicos (NJ, ML, Bayesian) y programas elegidos
Presentacin como filograma (no cladograma), nmeros en los nodos, longitud grfica de las ramas.
Evaluacin comparativo de aspectos ms importantes para interpretar.
bootstrap o Posterior Probability como medidas de confianza, gamma rates y alpha, longitud de ramas,
orden de taxones (especies), raiz+outgroup, topologa uniforme?
5. Modelo 3D-PDB de Chimera y/o Consurf c/identificacin de sitios claves.

También podría gustarte