Está en la página 1de 231
TIAIGIY IS] Y}S|a Phylogenetic Trees a Easy A How-To Manual Second Edition Barry G. Hall University of Rochester, Emeritus oto eal M% Sinauer Associates, Inc. * Publishers MAN Scinderiand, Massachusetts * USA. ~ Privtocenemic Trets Mabe Easy: A How-To Manual, Second Edition Copyright © 2004 by Sinauer Associates, Inc. ll rights reserved. For information address Sinauer Associates, Inc., 23 Plumtree Koad, Sunderland, MA 01375 USA. FAX: 413-549-1118, orders@sinauer.com publish@sinuer.com www.sinauer.com Downloadable files to be used with this text are available on the accompanying CD and at http://www.sinauer.com/hall/ Notice of Liability Due precaution has been taken in the preparation ofthis book. However, informa- tion and instructions described herein are distributed on an “As Is” basis, without warranty. Neither the author nor Sinauer Associates, Inc, shall have any liability to any person or entity with respect to any loss or damage caused or alleged to be caused, directly or indirectly, by the instructions contained in this book or by the computer software and hardware products described. Notice of Trademarks ‘Throughout this book trademark names have been used and depicted, including butt not limited to Macintosh, Microsoft Windows, Microsoft Explores, Netscape, and Adobe. In lew of appending the trademark symbol to each occurrence, the author and publisher state that these trademarked product names are used in an editorial fashion, to the benefit ofthe trademark owners, and with no intent to infringe upon the trademarks. Publication Data Library of Congress Cataloging’ Hall, Barry G., 1942- Phylogenetic trees made easy : a how-to manual / Barry G. Hall. — 2nd ed. Includes bibliographical references and index. ISBN 0-87893-312-3 (paperbound) 1. Phylogeny—Data processing. 1. Tite. QH367.5.127 2008 5768'80285—de22 2004010747 fils FRE SE 32 Acknowledgments Tam grateful to Joe Felsenstein for advice and help in learning to use PHYLIP, and to Jim Wilgenbusch for help in learning the command line interface of PAUP*. Lam grateful to Dave Swofford and Jim Wilgenbusch of the PAUP* team, to Joe Felsenstein of PHYLIP, and to Rod Page of TreeView for agreeing to incorporate into future versions of their programs some changes that | thought would be helpful to the readers of this book. I thank David Fitch for perusing the manuscript for this book and for his many valuable suggestions for improvement. Errors that remain are mine, not his, and are probably attributable to my stubborn refusal to accept some of his suggestions. Tam grateful to Melinda and Bill for joyously sharing adventures that we would not otherwise have taken on. Lam especially grateful to my wife, Sue, for her patience and encourage- ‘ment during the writing of this book. Much of my time that should have been spent on the process of moving into a new home was, in fact, spent writing, Finally, I take deep pleasure in thanking my children, Steve, Scott and Rebecca, for their companionship and the pleasure of their company. This book is ded- icated to them, (7 fis FR SE 32) Table of Contents Introduction: Read Me First 1 A Brief Overview of the Second Edition 2 Time-Limited Copy of the PAUP*4.0 Program 3 Learn More about the Principles 4 Computer Programs Discussed and Where to Obtain Them 4 Clustalx 5 TreeView 5 PAUP* 5 PHYLIP (PHYLogeny Inference Package) 6 Tree-Puzzle 6 MrBayes 7 CodonAlign 7 Other Programs 7 Files and Utilities on the “Phylogenetics Made Easy” Website and CD 7 Some Conventions Used in This Book 8 Chapter 1 Tutorial: Create a Tree! 9 Why Create Phylogenetic Trees? 9 Obtaining Related Sequences by a BLAST Search 10 Step 1: Go to the BLAST Website 11 ‘Step 2: Use BLAST to Search for Sequences Related to Your Sequence 13 Step 3: Decide Which Related Sequences to Include on Your Tree 14 Downloading the Selected Sequences 20 x Table of Contents Stetiadl Creating the Multiple Alignment 23 Creating the Input File 24 Getting the Data into ClustalX 26 Some General Comments about Creating Alignments 27, Setting the Alignment Parameters 27 Creating the Alignment 32 Refining and Improving the Alignment 33 Aligning New Sequences to an Existing Alignment, or Aligning Two Existing Alignments 38 Phylogenetic Analysis 40 Exactly What Is a Phylogenetic Tree? 41 Methods for Constructing Phylogenies 41 LEARN MORE ABOUT PHYLOGENETIC TREES. 42 Using ClustalX to Create a Neighbor Joining Tree 46 Drawing the Tree Using Tree View 47 ‘Tree Formats: Different Appearances of the Same Tree 47 Bootstrapping a Tree 50 LEARN MORE ABOUT ESTIMATING THE RELIABILITY OF PHYLOGENETIC TREES 52 Placing the Root of a Tree 53 LEARNG MORE ABOUT ROOTING PHYLOGENETIC TREES. 56 Printing and Saving the Tree 59 Summary 60 Chapter 2 Basic Elements in Creating and Presenting Trees 61 Selecting Homologs: What Sequences Can Be Put on a Single Tree? 61 Fine-Tuning Alignments 64 Major Methods for Creating Trees 68 Which Method Should You Use? 68. Distance versus Character-Based Methods 69 LEARN MORE ABOUT TREE-SEARCHING METHODS 70 LEARN MORE ABOUT DISTANCE METHODS 74 Data Files Used to Illustrate Methods 76 Table of Contents xi Using PAUP* to Create Trees 76 Opening the Input File 77 Creating Neighbor-Joining Trees Using PAUP? 80 Creating Parsimony Trees Using PAUP* 92 LEARN MORE ABOUT PARSIMONY 94 Creating a Consensus Tree Using PAUP* 100 Creating Maximum Likelihood DNA Trees Using PAUP* 102 LEARN MORE ABOUT MAXIMUM LIKELIHOOD 104 LEARN MORE ABOUT EVOLUTIONARY METHODS 110 Creating Maximum-Likelihood Protein Trees Using Tree-Puzzle 114 Creating Bayesian Trees Using MrBayes 118 Creating the Execution File 119 LEARN MORE ABOUT BAYESIAN ANALYSIS 120 What the Statements in the Example MrBayes Block Do 123, Interpreting MrBayes Results 128 Sample Blocks for MrBayes 130 Getting Help 132 Presenting and Printing Your Trees 135 Opening Tree Files in PAUP* 135 ‘To Root or Not to Root? 139 Choosing What Form of a Tree to Publish 147 Making a Tree Pretty: Not Just a Cosmetic Matter 148 DNA Phylogeny or Protein Phylogeny: Which Is Bette Using Codon Align 2.0 154 2 152 Chapter 3 Advanced Elements in Constructing Trees 157 Reconstructing Ancestral DNA Sequences 157 Ancestral Sequences for Parsimony and Maximum Likelihood Trees Using PAUP* 158 Ancestral Sequences for Parsimony using PHYLIP 161 Using Protein Structure Information to Construct Very Deep Phylogenies 167 ‘Analyzing Trees for Evidence of Adaptive Evolution by Detecting Positive Selection in a Phylogeny 173 xii Table of Contents Chapter 4 Using Alternative Software to Construct and Present Trees 179 Using PAUP* with Windows or Unix 179 ‘Opening and Viewing Files 180 Using PAUP Blocks 182 Bootstrapping 185 Using PHYLIP 187 General Features of PHYLIP Programs 188 Parsimony Trees 189 Consensus Trees 190 Neighbor-Joining Trees 190 Maximum Likelihood Trees 191 Bootstrapping 191 ‘ile Formats and Their Interconversion 193 Appendix | Formats Used by Programs Discussed in this Book 193 ‘The FASTA Format 193 ‘The Clustal Format 194 ‘The Nexus Format 195 PHYLIP 3.x 197 Other File Formats 199 ‘The GCG/MSF Format and PileUp 199 The NBRF/PIR format 200 Interconverting Formats Using PAUP* 201 Importing Various File Formats into PAUP* 201 Exporting Various File Formats from PAUP* 202 Interconverting Interleaved and Sequential Formatted PHYLIP Files 203 tote enay Table of Contents xii Appendix II Printing Alignments 205 Printing to Assess the Quality of the Alignment 205 Printing Alignments for Publication 206 Literature Cited 207 Index to Major Programs Discussed 209 Subject Index 215 fis FR & 32) Introduction Read Me First This is a “cookbook” intended as a tool to aid beginners in creating phyloge- netic trees from protein or nucleic acid sequence data. It assumes basic famil- iarity with personal computers and with accessing the World Wide Web using browsers such as Netscape or Mictosoft Explorer, I have not attempted to explore all the alternative approaches that might be used, intending only to give the beginner an approach that will work well most of the time and is easy to carry out. Ehope the book will also serve the investigator who has a mod- est familiarity with phylogenetic tree construction but needs to address some aspects and problem areas in mote depth, ‘Whatever the user’s level of experience, this book devotes a significant amount of attention to the problem of aligning proteins and nucleic acids Although one of the major purposes of alignments is to create phylogenetic trees, itis far from the only important application of alignments. Studying alignments ofa series of related proteins can provide insight into possible active sites by identifying those regions that are most strongly conserved and that contain probable catalytic residues. Thus those investigators whose interests are more in protein function that in phylogeny per scare also the intended audi- ence for this book. This book is not intended to be used as a primary text in a systematics or phylogenetics course, and it is not appropriate for that purpose. It can, how- ‘ever, be used as a supplement to the primary text and can serve as.a tool for ‘making the transition between a theoretical understanding of phylogenetics ‘and a practical application of the methodology. 2 Introduction A Brief Overview of the Second E As pointed out in the first edition, software changes quickly, and some screens may well look different from those depicted in this book. Indeed, within a few weeks of publication of the first edition, the appearance of the Web-based BLAST screens were substantially different. Itis not only cosmetic appearances that change. Programs change the names of input parameters, what is con- tained within various output files, etc. Such changes are intended to simplify things for users and are often part and parcel of improving a program's func- tionality, but they do make it awkward for a beginner trying to follow a book in order to learn both a method and the software to implement that method. One of the major motives for this edition was to make the book current with respect to the software as of the end of 2003. When a simple repair of a shower in my home revealed serious and exten- sive dry rot my contractor said, “You know, you don’t have to put this bath~ room back exactly the way it was. If you are ever going to remodel it, this is the time.” That turned out to be good advice. In the same sense, I decided that if | was ever going to reorganize this book to make it more useful, this is the time. One of the critics who reviewed the first edition was kind enough to take the first sentence of the book seriously and compared it toa well-written cook- ‘book. To take that analogy a bit farther, this edition is organized as one might a cooking class for someone who had never prepared food—indeed had never even seen a kitchen—and assumed that food simply appeared by magic. ‘One might begin such a class with an entire evening devoted to boiling an egg, beginning with purchasing the egg, boiling water, putting the egg into the water for a specific duration, cracking the egg, and finally, presenting the egg at a table. It’s not a lot, but it allows the student to experience the basic elements of cooking and results in something edible at the end of the class. That is exactly the purpose of Chapter 1, “Tutorial: Create a Tree!”. The read- er learns one way to create a tree that is a valid representation of the histori- cal relationships among a set of related sequences. Itis not enough for a meal, butitisa start The next chapter, “Basic Elements in Creating and Presenting Trees,” might be compared with demonstrating sautéing, baking, broiling, and roasting, Each is an important method and itis necessary to know something about all of ther if one is to prepare any but the most limited meals. Chapter 2 explains how to fine-tune alignments—the raw data from which phylogenies are constructed— and then presents each of the major methods for constructing phylogenies in some detail, with examples. The chapter finally tums to methods of present- {ng trees in publications and how to troubleshoot some common problems, Chapter 2s the meat of this book. Understanding Chapter 2 will permit con- fident construction of phylogenetic trees under circumstances that anyone but an expert systematist is likely to encounter. Read Me First Throughout Chapter 2 you will find notes such as Paup for Windows/Unix: pages xa-yyy and PHYLIP: pages x-yyy. ‘These notes guide you to the pages in Chapter 4 where the same method is discussed. Chapter 3 presents some advanced topics for those who may want to go beyond the basics. It is not necessary to understand, or even to read, Chapter 3 in onder to construct sound, valid phylogenies. Chapter 3 is like learning, to make soutflés or spectacular flaming desserts. Itis not necessary, but it can sure add a nice touch. The advanced topics in Chapter 3 include reconstructing ancestral states (Le. estimating, the sequences of ancestral genes); analyzing trees for evidence of adaptive evolution; and constructing very deep phylo- genies from protein crystal structure data. Chapter 4 recognizes that not everyone will want to invest in the purchase of PAUP®, the primary software package discussed here for tree construction. Itisalso the case that PAUP? for Windows does not use the same interface as PAUP* for Macintosh computers, from which the examples in book are drawn. Chapter 4 therefore covers the details of tree construction using PAUP* for Win- dows and using the free and widely used PHYLIP program. The only cooking analogy for this chapter is that itis like low-calorie cooking, Not as much fun, but sometimes necessary. Time-Limited Version of the PAUP*4.0 Program ‘The CD accompanying this book is compatible with both Windows and Mac- intosh computer platforms. It includes a fully functional but time-limited ver- sion of PAUP*4.0 betall0 for each platform. Users can install PAUP* onto one computer, and it will be functional for a period of six months from the date of installation, ‘The purpose of including this time-limited PAUP* is to permit students to become familiar with the program, especially in the context of a course for which this book may be required. Those students who wish to continue using PAUP* for phylogenetic analysis after that time are encouraged to order PAUP* from Sinauer Associates (http://www.sinauer.com). An advantage of pur- chasing PAUP* is that a registered purchaser is entitled to free updates of all beta releases, and to the final package release (which is expected to include a user's manual). Registered purchasers are also notified by email of any issues that may arise with respect to the use and/or performance of PAUP*. Stetiadl 3 4 Introduction Learn More about the Principles Just as it is possible to implement molecular methods without understanding them by following the protocols in commercial “kits,” it i also possible to implement phylogenetic methods without understanding them by following the protocols in this book. Most of us insist that our students understand the principles underlying the methods implemented by these kits, because we know that without such an understanding it is impossible to spot and trou: bleshoot many problems. Itis in this spirit that the reader will find “Learn More” boxes scattered throughout the text, These boxes present somewhat more detailed background on the various methods and suggest further reading, Itis not necessary to read the boxes to be able to construct a reliable, valid, phylogenetic tree, but under- standing the principles outlined will help troubleshoot the phylogenetic prob- lems that arise when creating trees from molecular alignment data. Readers who want to go beyond the Learn More boxes will find Dan Graur and Wen-Hsiung Li's Fundamentals of Molecular Evolution (Graur and Li 2000) very helpful and enjoyable to read. Li's Molecular Evolution (Li 1997) and Chap- ters 11 (Swofford et al. 1996) and 12 (Hillis etal. 1996) of Molecular Systematics (David Hillis, Craig Moritz, and Barbara Mable, eds.) provide more detailed insights into these topics. Computer Programs Discussed and Where to Obtain Them Al of the programs described in this book are available for both Macintosh and Windows platforms, and most are available for Unix as ‘well. The examples in this book use the Macintosh versions of softw. that were current in November of 2003. The Windows and Unix ve of the programs may vary slightly in detail, but not so much as to preclude using this book as a guide for those platforms. The Windows versions of these programs have been tested by the author, and where they differ significantly from the described Macintosh programs the differences are noted in the text. Software authors are continuously | upgrading (and sometimes even improving) their software, so some ‘menus, windows, and dialogs may differ from the examples shown. If you don’t see a feature that is mentioned here, look around a bit. It may he ina different menu or have a slightly different name. Users should [ALWAYS register these programs when registration is available. Registration often allows software authors to alert users of updates and new features. If you have access to a Macintosh computer I strongly recommend that you ‘use that machine for phylogenetic analyses and that you obtain PAUP* for Mac- intosh. Modern Macs, especially the new G5 computers, are as fast or faster than the fastest Windows machines. The Macintosh implementation of PAUP* is superior to the Windows/ Unix implementation because it includes a high- resolution tree-drawing program that allows you to present your trees easily and conveniently. If you can possibly do so, buy PAUP*. Its considerably more convenient to use than the free PHYLIP. Clustalx ‘ClustalX (Thompson et al. 1997) is the primary multiple alignment program, Files created by ClustalX can be used by other programs to display and print trees, as well as to display alignments in ways that facilitate recognizing regions of high similarity. ClustalX for Macintosh, Windows, Linux, and some Unix ‘machines is available free over the Web at ftp:/ttp-igbme.u-strasbg,fr/pub/ClustalX! Help for ClustalX is available at www-igbme.strasbg.fr/Biolnfo/ClustalX/Top.html Examples in this book were created using ClustalX version 1.81. IF you have an earlier version of ClustalX, or if you are still using ClustalW, you will find ituseful to download the most recent version of Clustal from one of the above sites. If your version is more recent than 1.81, the appearance of some screens ‘may differ slightly from the illustrations, TreeView TreeView isa free program for drawing phylogenetic trees. It does not create those trees, it simply uses files created by phylogeny programs to display and print the trees. TreeView allows you to modify the appearance of a tree tosuit you taste and needs. Tree View for Macintosh, Windows, Unix, and Linux is available at hitp:/Itaxonomy.zoology.gla.ac.uk/rod/treeview.html ‘TreeView will soon be replaced by TreeView, currently under development. ‘The incomplete development version is currently available from_ hitp://darwin.zoology.gla.ac.uk/~rpage/treeviews/ ‘TreeView promises some interesting new features, so it is worth checking (on the progress of this major revision. PAUP* PAUP* 4.0 (Swofford 2000) is the primary tree-building program discussed in this book. PAUP* uses the files created by ClustalX to build trees by any one of several methods. PAUP*4.0 beta is available for Macintosh, Windows, and Read Me First 5 6 Introduction Linux/Unix. PAUP* is an inexpensive commercial program available from Sinauer Associates, Sunderland, MA. Information about ordering PAUP* 4.0 is available at hUtp/www.sinauer.com/detail.php?id=8060 and at the PAUP* home page, http://paup.csit-fsu.edu/ Wherever you order PAUP* be sure to go to the PAUP* home page and down- load both the Command Reference Document and the Quick Start Tutorial. Although itis a superb program, PAUP*4.0 is not yet in its final form. The current release is PAUP*4.0610, meaning beta version ten. It is not clear that PAUP*4.0 will ever be finalized or that a manual for PAUP* will ever be writ- ten, but each beta release represents a significant improvement over the pre- ‘vious beta. In fact, each beta release corresponds to a minor new version. If you cannot be comfortable without a manual for PAUP", use PHYLIP to construct your trees and TreeView to draw them. Despite the absence of a manual, I recommend PAUP* and I strongly recommend using the Macintosh version if Macintosh computer is available to you. The Macintosh version (but not the Windows or Unix versions) includes an excellent interface for drawing and printing trees. PHYLIP (PHYLogeny Inference Package) This is a package of programs for inferring phylogenies (evolutionary trees). Itis available free at hitp:/evolution.genetics.washington.edu/phylip.html PHYLIPis written to work on as many different kinds of computer systems as possible. The source code is distributed (in C), and for some operating systems executables are also distributed. In particular, already-compiled executables are available for Windows95/98/NT, Windows 3.x, DOS, and Macintosh sys- ‘tems. Complete documentation is available on documentation files that come with the package. ‘The PHYLIP source code is available to be compiled on any Unix machine. Tree-Puzzle ‘Tree-Puzzle is a program for constructing Maximum Likelihood (ML) trees from DNA and protein sequences, It is available free of charge from hitp:/wwwtree-puzzle.de/ (Tree-Puzzle home page) hitp:/iubio.bio.indiana.edu/soft/molbiolevolve (IUBio archive www, USA) fepuliubio.bi .indiana.edu/molbio/evolve (IUBio archive ftp, USA) Stp//ftp.pasteur-fr/pub/GenSoft (Institut Pasteur, France) Read Me First MrBayes MrBayes (Huelsenbeck and Ronquist 2001) isa program for constructing phy- logenetic trees by the Bayesian method. MrBayes is available from wwwamrbayes.net The currently available version (November, 2003) is a beta version, MrBayes 3.065, but the final MrBayes 3.0 should be released by the time you read this Please note that some commands have changed from MrBayes 2.0, so readers interested in using the examples in this book should update to MrBayes 3.0 CodonAlign Codonalign isa simple program that creates a DNA alignment based on align- ‘ments of the corresponding proteins. It introduces into each DNA sequence a triplet gap at the position of each gap in the aligned protein sequence. Codon- Align is available from this book's companion website, hitp://sinauer.com/hall/ The new version, CodonAlign 2.0, is somewhat easier to use than the original Users of the earlier version should be sure to read the new documentation for Input file format. Other Programs There are many phylogenetics programs that I have not mentioned here, some of them widely distributed. My failure to mention a program does not imply that the program is not useful or valuable. Ihave chosen to include a minimal set of programs that will allow you to implement the methods I discuss in this book, and I have tried to include only programs that are available for Macin- tosh, Windows, and Unix platforms. See hitp://evolution.genetics.washington.edu/phylip/software.html for a more extensive list of phylogenetics programs. Files and Utilities on the “Phylogenetics Made Easy” Website and CD The “Phylogenetics Made Easy” (PME) website was created especially for this ook. Located at http:/www.sinauer.com/hall/ the site contains a variety of files that will make it easier for you to follow the tutorial without actually downloading the sequences to create an input file for ClustalX. It also includes copies of all of the relevant output files So you can compare your results with those I obtained in the event that you have difficulties 7 a fis FR & 32) In addition, the CD that comes with this book includes all of the files that are on the website at the time of publication, Instead of downloading, you can simply copy those files from the CD. ‘The site and CD also include templates, o boilerplate, for various input files. You can copy those templates directly into your own input files in liew of typ- ing everything manually. Itis amazingly easy to make a tiny typing error, such as “Ist _pos” instead of “Ist pos’, and spend hours trying to figure out why your program won't run. Copying blocks of critical text, then modifying those blocks for your specific needs, is a good way to avoid such problems. Indeed, ‘many of us use such templates routinely. Throughout the text these files and templates are indicated with an icon. For example, Introduction “ee Chapter 1 files:Clustal#1:Metalloaln ‘means on the website in the Chapter 1 files folder, in the Clustal#l folder, the Metallo.ain file. ‘The site and CD also include a copy of the program CodonAlign 2.0, men- tioned earlier:They do not include copies of ClustalX, TreeView, or MrBayes. Although these programs are distributed at no charge, itis much better for you to download the most recent versions from the websites listed above than to ‘work with the versions that were current at the time this book was assembled. Some Conventions Used in This Book ** Click, as in “click the OK button,” means to use the mouse to position the cursor over the inclicated button on the screen and to depress and quickly release the mouse button, Double Click means to click twice rapidly, without moving the mouse. ‘+ Drag means to position the mouse and, while holding down the ‘mouse button, move the mouse to another position. «+ Select means to highlight a section of text or an object on the screen by dragging across the indicated region or by double clicking, on the object. ‘+ In the text, the Chicago font indicates a menu item or a button that you will see on the screen. + For command-line programs such as PHYLIP, MrBayes and CodonAlign, the Courier font indicates text that you will see on the screen or that you will type into an input file. fis FR SE 32) Chapter 1 Tutorial: Create a Tree! Why Create Phylogenetic Trees? Today phylogenetic trees appear frequently in molecular papers that are unre- lated to phylogenetics or to evolution per se. Their inclusion reflects the grow- ing recognition of trees as a foo! for understanding biological processes. Phy- logenetic trees allow you to organize your thinking about a protein of interest in terms of its relationship to other proteins, and may allow you to draw con- clusions about its biological functions that would not otherwise be apparent. As genomes are being sequenced at rapid rates, our knowledge about DNA and protein sequences is far outstripping our knowledge about biological and biochemical functions. As a result, we are frequently forced to assign biologi- cal funetions to proteins on the basis of sequence homology alone. As the sequence databases grow larger, more and more assignments are based on homology to sequences whose functions have been assigned only tentatively, based on homology to still other sequences. Examination of a phylogeny can allow you to determine just how closely or how distantly your sequence relates to.a sequence whose function is actually known from biological or biochemi- cal information. For a long time, it was traditional to scan sequence databases for related. sequences, then to publish a table giving pairwise homologies expressed as percent identities or percent similarities. As databases grew, it became impos- sible to present tables of all the homologs, so we started to create multiple alignments with programs such as Clustal and PileUp. The process of creating a multiple alignment begins with computing all pairwise alignments, then making a rough “guide tree” from those pairwise comparisons. Guide trees are often published by molecular biologists as phylogenetic trees. Because guide trees are based only on pairwise comparisons rather than on an overall Wiss FRE 4 multiple alignment, the trees are not based on evaluation of sites that are homologous across all sequences, and thus they often contain significant errors and can lead to incorrect interpretations of data. Valid phylogenetic trees are created in order to avoid those errors of interpretation. ‘As covered in this book, the steps involved in creating a valid phylogenetic tree from molecular sequence data are 10 Chapter 1 1. Identify a protein or DNA sequence of interest. Identify other sequences that are related to the sequence of interest and obtain electronic files of those sequences. Align the sequences, Using the resulting alignment, generate a phylogenetic tree. Print (and perhaps publish) the results, Ifyou are reading this, you have probably already finished step 1. The remain- ing steps require a computer that is connected to the Internet and a set of suitable programs. This manual will guide you through these steps, and for cach step I will suggest suitable programs and provide information on obtain- ing them. For steps that can best be accomplished over the World Wide Web, Iwill provide website addresses. (The reader should of course be aware that such addresses can change and may no longer be valid when you read this manual.) ‘The tutorial involves constructing a tree from protein sequence data. I chose to use a protein example—a metallo-f-lactamase—instead of a DNA example, because proteins involve a complexity that does not apply to nucleic acid sequences: the alignment of amino acids that are similar but not identical to each other. The reader who wants to use DNA sequence data can still follow the steps in Chapter 1 Obtaining Related Sequences by a BLAST Search Even if you are already familiar with BLAST searches (Altschul et al. 1990; Altschul et al. 1997) and downloading sequences of interest, you ‘may want to read the discussion in Step 3 on how to decide which sequences to download. This tutorial will have the most value if you use your computer to follow along at every step. In several places the descriptions assume that you are doing so and therefore do not include screen shots of every detai Usually you already have a particular protein or nucleic acid sequence that ‘you are interested in and you need to find other sequences that are related to it. By “related,” we mean that another sequence is sufficiently similar to the Tutorial: Create a Tree! sequence of interest that we believe the two sequences share a common ances- tor (ie, they are related by descent) ‘The easiest way to find related sequences is to use a program that will search. the international nucleic acid and protein databases for similar sequences. For- ‘tunately, you do not even need to own such a program; you can do the entire search of the World Wide Web courtesy of United States government com- puters, The search-and-download program we will use in this tutorial is called BLAST. BLAST uses your sequence as a “query” to search the world’s com- ined databases of protein and nucleic acid sequences I will assume that your sequence exists as some sort of electronic file— pethaps a simple text file, or perhaps as a file from a sequence manipula- tion program. Almost any format will do. The example I use here is the L1 metallo-f-lactamase protein sequence. Step 1: Go to the BLAST Website Using either Netscape or Microsoft Explorer, go to http//www.ncbi.nim.nih.gov/BLAST/ (Not all browsers handle events the same way, so I do not recommend other browsers.) The window will look like Figure 1-1 Figure 1.1 W 12 Chapter 1 ‘The window is divided into several major sections: + Nucleotide, used when your sequence is DNA or RNA. * Protein, used when your sequence is a protein. «Translated, which Icts you search the nucleic acid databases with a protein sequence or the protein databases with a nucleic acid sequence. + Genomes, which allows you to search within the complete or partially sequenced genomes of a variety of organisms. « Special, which includes Align two sequences (bI2seq). That func- tion, instead of searching the databases, aligns two sequences using the BLAST algorithm. This is a very useful tool for determining whether two sequences are sufficiently related to be homologous to each other. «Meta, which allows you to retrieve the results of an earlier search by using the request identification number (RID). Because we are going to search for similar protein sequences, we will elick the Protein-pratein BLAST [blastp] link under Protein to see the screen in Figure 1.2 (out with the Search box empty). enter Tien =. oman Tutorial: Create a Tree! 13 Step 2: Use BLAST to Search for Sequences Related to Your Sequence Open the electronic file containing your sequence of interest and copy the sequence. Return to the BLAST page in your browser program, click in the text box next to where it says Search, and paste your sequence into that box (Fig- ure 1.2), BLAST refers to that sequence as the query sequence. To speed the process up a bit you can uncheck the box that says Do CD Search (arrow), and also check the box in the lower portion of the screen that says Mask for lookup table only. If you are reading this as a tutorial and would like to follow along exactly with the text, you can use an alternative method to enter the sequence of inter- est. Simply type the accession number for the sequence of interest into the text box instead of pasting in the sequence. The accession number for the L met- allo-B-lactamase is P52700. Now click the BLAST! button to submit your sequence to BLAST. You will see the sereen shown in Figure 1.3 29 BLAST. ‘Youre habeus sees sd ed pat i te Hh Qe (Query = 290 es) ‘Thesegut ID GR TEES prow finnt apie Yoreay meats ates em ede on or Rapiainee Pua ease CH Aa HM ams Sete in TB Suen nant (Pie B 4 Chapter 1 ‘The screen gives you two important items of information: (1) the Request ID; and (2) how long you can expect to wait for the results. You should copy or write down the Request ID in case you need to go back to this search later. ‘To return toa search, scroll down to the bottom of the BLAST Home page (Figure 1.1) and click the link under Retrieve results for an existing Request ID, The resulting sereen will look like Figure 13, except that you have to enter the Request ID in the box at the top. Click the Farmatt! button to return to your search. (BLAST will retain the search results for a couple of days, but not forever) In the lower section of the screen, you can choose the number of descrip- tions and alignments that will be returned, So that you can see alignments for each of the sequences that are returned, set the number of alignments to 100 to ‘match the number of descriptions (Figure 1.3, arrow). Once the estimated time thas passed, click the Format! button. The time estimate is often optimistic, but if you click too early BLAST will simply tell you to wait awhile, Step 3: Decide Which Related Sequences to Include on Your Tree Eventually you will getback a response so that the page now looks like Figure 14. Tutorial: Create a Tree! 15 ‘This screen tells you that the program search included the non-redundant ‘Genbank coding sequence translation, the PDB database of protcin structures, the SwisProt database, etc., and that almost 1.5 million sequences were com- pared with the query sequence. There is a graphical representation of the aligned sequences that shows that many of the high-scoring sequences were about the same length as the query sequence, but that the low-scoring sequences were typically much shorter. ‘One hundred related sequences were found. More might have been found, but we limited the number to 100 (Figure 1.3); we can go back and change that limit if we wish, Scroll down until you see the list of related sequences that Jooks something like Figure 15. BEEEEEERRERREE = EEEERE EE bbbb eke eEbkebekeREBERERERE: 16 Chapter 1 [Atthe left of each sequence is the name of the file, consisting of the gi num- ber and the accession number. That file isin one of the databases that has been searched. The gi number is a unique identifier that can be used to locate a file, whatever the database it resides in. That riame, which is underlined, is a link, and clicking it will take you directly to the fle. Inthe middle is a brief description of the file. Next is the bit score. Calculating the bit score is complex, but all you real- ly need to know is that the higher the bit score, the more closely related that sequence is to the query sequence, i. to your sequence of interest. The bit sore isalso fink that will take you farther down the page toa display of your query ‘sequence aligned to that sequence. ‘At the far right isa number, the E value, which is the number of such match- «sto the current non-redundant sequence database that are expected by chance alone. The smaller the E value, the more likely that the similarity is “real’”—that is, thatthe similarity reflects common descent and is not the result of sheer chance. ‘The first file in the list in this case is the query sequence itself, Because that ‘sequence was already in the databases. Notice that as you read down the list the E values increase. Scroll down to the end of the list, The last sequence in the list is a “conserved hypothetical protein.” The problem you face is choos- ing which sequences you want to keep in your list of related sequences. This is ‘critical choice, because that list wil be the set of sequences from which you construct your phylogenetic tre. Tn constructing a phylogeny itis essential that we only include sequences that are truly homologous (i.e. that are descended from a common ancestral sequence) ona given tree. As a rule of thumb, we can be confident that sequences with E values $10 are homologs of the query sequence. With that in mind, you can climinate from consideration the 62 sequences below the sequence .gi|11498163 [ref |NP_069409.1| conserved hypothetical protein... 52 16-05 Scrolling back to the top of thelist (Figure 1.5), notice that the second in the list hasan Sto the right of the E value. That S indicates that this is a protein struc- ture file of the same query sequence, so you can eliminate that one also. At this point it is useful to scroll down past the list of sequences to the alignments themselves (Figure 1.6) ‘You will use these alignments to decide which sequences to include in the phylogeny. While you are doing so, you will also select those sequences for later downloading, You certainly want to include the query sequence itself, so tick the box indi- cated by the arrow. Skipping the identical sequence, note that the next two sequences are also called “L1” and are from the same species, Stenotrophomonas ‘maltophitia. These sequences are 92% and 88% identical to the query sequence, respectively. If| wanted an exhaustive list of every homologous sequence I ‘would include them, but because they represent only within-species variation, Tutorial: Create a Tree! 17 ALSIOMCRinL2S40,N01 KAR nealioseuaastamen 1 peaeac (Qete-latane, pe $1 just choose one of them. [At this point I assume that you are following this exercise on your computer and that you can actually scroll down to see what Tam discussing,] I will also include >gi (6977948 enb|CAB75346,1| metallo beta-lactamase [Stenotrophononas maltophilial 18 Chapter 1 ‘The next sequence, however, is shorter and will not be included (Figure 1.7) For that sequence the first amino acid aligns with residue 23 of the query sequence. It and the next four sequences are simply mature L1 in which the leader peptide is not given. They will not be included. ‘Continuing down the alignments in this fashion, select the sequences to be included, in each case ticking the selection box atthe left I've chosen to include only one of the GOB variants, but you might choose to include several of these if you want a more exhaustive phylogeny, ‘When I reach 2g |20090857 | ref |NP626932.1 [notice that the sequence alignments have become short and the E values have risen above 10” (Figure 1.8). Since for my purposes Lam only interested in sequences that align over most of the length of the query, I will exclude these sequences and all sequences below them on the list. emphasize for my purposes because these decisions Tutorial: Create a Tree! 19 ‘Spaseemaaintln_iamee_ trios tnangeeet point mtaiosnttetoaee re re a Figure 1.8 depend entirely on the purpose of your phylogeny. You might well want to go much farther down the list if you are interested in related proteins that have greatly diverged from your protein of interest. The decisions about which ‘sequences to keep and which to eliminate cannot be reduced to an algorithm; they depend on what you intend to accomplish with your phylogenetic tree. ‘Do you want as complete a tree as possible? In that case, you will keep every- thing that is probably a true homolog and is not identical to another sequence. (Note that more than one investigator may submit the same sequence to GenBank, resulting in duplicate entries.) If you only want to show represen- tatives of the major groups, you will be much more selective. 20 Chapter 1 Downloading the Selected Sequences ‘The BLAST Report Web page provides a convenient means of downloading to your own computer the sequences you have just identified, Scroll to the beginning of the alignments and click the Get Selected Sequences button (Figure 1.9, arrow), uosrosonsonersneens4sse: S/N kebkkkkkekekekeeeeel Watuuntanaimmas na sane RE pie tent tt ater EL Figure 1.9, Tutorial: Create a Tree! 21 ‘You will now see the screen in Figure 1.10. (Caan as —— se Renee 10 Te Meullo-bustatanase 1 pecur (ets-acamase tye I) Pencinase) 8117084 7RlPS27001LAL_XANMA(I 705478) 2: CARESS in mana Laks L bets lactamase [Stenouvphomeras mali] su6SROSHHIenaNCABSS488 IN 65H0599) cAB7S346 fine Dorn ke ‘really betslactanase[Stopbunas mapa) |9697798HlmBsCABTS346. 16977948) Np mgr tine Domine ca ‘ulate beta ce(Sulmonela ntsc subsp enterica serovar Typhi 732} 592914406 Ie ANP 807403 14291-4061) e728 Lt Doi ha E6230 [Bray hizobium japonicus) E2738 SIN T7280 SALSA] igure 1.10 ‘Ten sequences were selected (scroll down to sce all of them). Tick the box at the left of each sequence (Figure 1.10, arrow) to select them all. Next, change the Display choice from Summary to FASTA and the Send To choice so that it reads File. Click the Display button to see Figure 1.11 ose noal 22 Chapter 1 eee om er ieee 2016 ae Ose pate 1: P70. Meet 170578) ‘moo ie SR CARES. C1 Reson. 6909] sn oom te 3 CABS6 fa ee hE) a Donn te 0703 pt ene g2 6 Figure 1.11 FASTA is the file format that you will use in the next step to create an align- ment. Again, be sure to tick each of the selection boxes, then click the Send To button. In the resulting dialog, name the file something like MetalloBla.fasta, You can compare the FASTA file you just saved with the one I created by down Toading MetalloBla.fasta from the Chapter 1 files GB Chapter 1 Files: MealloBlatasta Now change the Display to GenPept and click the Display button to see Figure 1.12. GenPept files provide a wealth of information about each sequence, information that you may well need later when you write your paper. You will again need to scroll down to select each file, then click the Send to (File) but- ton. In the resulting dialog change its name to something like MetalloBla.Gen- Pept and save it, ‘The files you just saved are text (ASCII files that can be opened in any word. processor. Myton Tutorial: Create a Tree! 23 = Se oe om mca in “psa weet tne . Wi Don ie Figure 1.12 Creating the Multiple Alignment If you are familiar with ClustalW or earlier versions of Clustal (Higgins and Sharp 1988; Thompson et al. 1997; Thompson et al. 1994), you should probably skim this section to become aware of the new “ dows” style interface offered by ClustalX. You should also look at Chapter 2 to see some of the new capabilities of ClustalX that were not available in ClustalW. A pair of sequences can be aligned by writing one sequence above the other in such a way as to maximize the number of residues (nucleotides or amino acids) that match by introducing gaps (spaces) into one or the other sequence, Bio- logically, those gaps are assumed to represent insertions or deletions that occurred as the sequences diverged from a common ancestor. 24 Chapter 1 If we could insert as many gaps.as we choose, we could align any two ran- dom, unrelated sequences so that all residues either matched perfectly or were ‘across from a gap in the other sequence. Such an alignment would be mean- ingless, however. Its necessary to somehow constrain the number of gaps so that the resulting alignment makes biological sense. To do that, a scoring sys- tem is used so that matching residues get some sort of positive numerical score, and gaps get some sort of negative score, or gap penalty. An alignment pro- ‘gram seeks an arrangement that maximizes the net score. For nucleie acid alignments, matching residues usually geta score of 1 and mismatches get a score of 0. For protein sequences, scoring is more compli- cated because mismatches between biochemically similar amino acids usual- ly get an intermediate score. Those scores are usually determined by the align- ment program itself. Details of those scoring methods will be discussed in Chapter 2 in the section on “Fine-Tuning Alignments” (page 64). Gap penalties, on the other hand, are typically set by the user, and typical- ly there is a penalty for creating a gap plus an extra penalty for the length of the gap. Details are covered later in this chapter, in the section on “Changing Gap Penalties” (page 34) Aligning a pair of sequences is not a computationally difficult process, and a variety of programs exist to align sequence pairs. Multiple alignments are considerably more complex, and only a few programs do a really good job. The ClustalX program is one of the best tools for creating multiple alignments. ‘ClustalX is an updated version of ClustalW, an old-fashioned “menu-driv- en” program. In terms of what it does, ClustalX is virtually identical to ClustalW, but ClustalX has a windows environment that will be familiar to Macintosh, PC, and Unix users alike. To better understand what ClustalX does, pull down the Help menu in ClustalX and read the various entries. For more details on ClustalX, you can go to the online ClustalX help file on. the Web at: www-igbme.u-strasbg.fr/BioInfo/ClustalX/Top-html Creating the Input File ClustalX, like any other computer program, requires that the data it manipu- Iates (the input file) must be in a format that it can recognize. You can use your favorite word processor to create the input file. The input file must contain each of the sequences that are to be aligned, In our example use the Wetal 1oS1a. fasta file that you saved from the BLAST search. For convenience, you will edit that file. Just to be safe, though, first make a copy of the Metal 1oBla. fasta file and name it something, like Metallo. fasta. Itis always a good idea to work on copies rather than orig- inals of important files. Tutorial: Create a Tree! ClustalX will recognize several formats for the sequences, but we will use the FASTA format (see Appendix I) because we downloaded sequences in that format, The FASTA format can be recognized because the first line begins with the “>” character. That character is followed by a single word that ClustalX will use as the name for the sequence in the multiple alignment that it cre- ates. Open the file using your favorite word processing program. The first sequence in the file looks like >gi|1705476 | sp| P5270 |BLA1_XANMA Metaiio-beta-lactamase Li precursor (Beta-lactamase, type II) (Penicillinase) NRSTLLAFALAVAL PRAHTSRAEVELPOLRAYTVIA.SWLOEVAPLOTADHTWQIGTEDLIALLVQTEDGA “VLLDGMPQMASHLLDNMKARGV TPRDLELTLLSHAIADHAGP VAELKRECTGAKVBANABSAVLLBRGSS DULSERGDGETYPPANADRE VODGEVT VGGIVETALIENAGHITPGSTAHIWTDTRNGKPVRIAVADSLSAP (G¥QLGNPRY PHL IEDYRRSPATVRAL.PCOVLLTPHEGASNNDYBAGARNGAKALICKAYADBABQIEDS (CLAKETAGAR CClustal treats everything between ”>” and the first space as the sequence name, Because it wouldn’t be very helpful to have the sequence name appear as >gi|1705478 | sp|P52700| BLAL_XANMA in the multiple alignment display, we need to change the name to something more useful. The choice of sequence names can make a lot of difference. Some of the pro- grams we will use later will only recognize the first 10 characters ofthe sequence name; others will not accept certain characters, such as the ”-" character, in the name. In particular, he name cannot include any spaces because then Chustal will only read the first part of the sequence name. The safest thing to dois to always pick names of 10 or fewer characters that use only letters and numbers, Let's insert L1 immediately after the “>” character so that the first line now reads >L1 gi |1705478|sp|P52700|BLA1_XANWA Metallo-beta-lactamase 11 precursor (Beta-lactanase, type 17] (Penicillinase) Continue down the list of sequences, inserting a recognizable name (followed by aspaco) after each “>” symbol. In this example, the query sequence you used for the BLAST search was already in the databases. If your query were a new, unpublished sequence, that would not be the case and your FASTA file would not include the query. In sucha case, just enter the sequence manually by adding >sequenceName on a new line and pasting the sequence on the line directly below that. Finally, we will save the file in plain text, or ASCII, format. This is impor- tant because ClustalX will not recognize Microsoft Word, WordPerfect, or other word processor files. Stetiadl 25 26 Chapter 1 Getting the Data into ClustalX Start ClustalX and you will see a window that looks something like Figure 1.13. Pull down the File menu and choose the Load Sequences menu item. Navi- gate to the folder (subdirectory) that contains the input file (in this case, Net ~ allo. fasta) and choose that file. Clustal will load the data from that fle and the window will now look like Figure 1.14. The left pane lists the sequences according to the name that follows the “>” symbol in the input file. The right pane shows the beginning of each sequence. You can scroll to the right to see the rest of each sequence by using the scroll bar at the bottom of that pane. Figure 1.13 aS Chapter 1 Files: Metallo.fasta You may note that many of the residues are shaded in Figure 1.14. On your screen, those shades of gray will be different colors. The colors are applied according to a scheme that indicates the group of amino acids to which the consensus (most common) residues at each position belongs. At this point, however, the sequences have not yet been aligned and the consensus colors are meaningless. Tutorial: Create a Tree! Font size{_10] aor File Gt Dualisers-barryManuscripts:Book reviston-Phytogenctics Manua Figure 1.14 Some General Comments about Creating Alignments An alignment is not an absolute thing, It is a “best guess” according to some algorithm used by a computer program. One cannot simply have a program compute an alignment and, without further thought, use that alignment to cre- atea phylogeny. Itis necessary for the user to carefully and thoughtfully exam- ine each alignment to see whether it makes biological sense. Often it will be useful to modify some of the parameters used by the computer program in onler to improve the alignment. I will discuss such modifications in Chapter 2 in the section on “Fine-Tuning Alignments.” Setting the Alignment Parameters ‘ClustalX creates a multiple alignment in three stages: 1. It individually aligns each sequence to each of the other sequences in a series of pairwise alignments 2. Ttuses that set of pairwise alignments to create a guide tree. 3. Ttuses that guide tree to help create the multiple alignment, Inonder to create pairwise alignments, ClustalX needs to know what penalties to assign for the creation of a gap and for the “extension” (length) of that gap, Myton 7 28 Chapter 1 Pull the Alignment menu down to choose the Alignment Parameters menu item, which will reveal a submenu from which you should choose Pairwise Alignment Parameters (Figure 1.15), CUM trees colors quality Help Help ‘Do Complete Al Produce Guide Tree Only ‘Do Alignment from Guide Tree fealign Selected Sequences Realign Selected Residue Range Align Profite 260 ‘igh Profites trom Guide Trees Align Sequences to Pratiie 1 Reset New Gaps before Al Reset All Gaps before Alignment Protein Gap Parameters Seconstary structure Parameters Figure 1.15 ‘You will then see a dialog box that looks like Figure 1.16. “LOSE Paitwise Alignments {_Stow-Accurate] [Pairwise Parameters} Gap Opening [0-100] {10.00 | Gap Extension (0-100 50:10] = [Stosunr a0. c pakaso © connec 20 © identity metre [User dennes DNA Weight Matr — 6 108 © CLUSTALW(1.6) © User denined Figure 1.16 Tutorial: Create a Tree! Alignment and gap penalty parameters. The first choice, Pairwise Align- ments, allows you to choose between a Slow-Accurate method and a Fast-Approximate method. The Slow-Accurate method is preferred, but if you are aligning so many sequences or the sequences are so long that the program takes a long time to run, you may want to use the Fast- Approximate method. Most modern computers are so speedy that you prob- ably will not need the Fast-Approximate method. ‘The box shows the default values for the Gap Opening penalty (10.00) and the Gap Extension penalty (0.10). Decreasing the gap penalties will allow the introduction of more gaps and will thus produce fewer mismatches in the alignment, but may also result in spurious matches that do not really reflect homology (identity by descent). Increasing the gap penalties will have the opposite effect: increasing the rigor of the alignment may result in missing ‘matches that actually do reflect homology. For aligning DNA sequences, I prefer the default parameters shown in Figure 1.16. For aligning protein sequences, I prefer increasing the gap open- ing penalty to 35 and the gap extension penalty to 0.75 as a starting point. The important thing to remember is that after the multiple alignment is com- plete, we will examine the alignment and see if changing the parameters will improve it Weight matrix parameters. As pointed out earlier, ClustalX seeks to maxi- mize the score of the alignment by giving high scores to matching residues and low or zero scores to mismatching residues. The TUB DNA Weight Matrix scores matches as 1.9 and mismatches as 0, except that it scores all X’s and N’s as matches to any TUB ambiguity symbols. The Protein Weight Matrix is more complicated because during align- ment ClustalX takes into account not only identity, but also biochemical and coding similarity of residues when calculating the score of the alignment. The various protein weight matrices weight different mismatches slightly differently. Each gives the highest weight to identical residues (e.g., Tyr-Tyt), but some mismatches get higher scores than others based! on the biochemical and functional similarities of the different amino acids (Tyr-Phe scores higher than Tyr-Pro, for instance). “The BLOSUM matrix appears to be the best for searching databases, The PAM ‘matrix has been used widely for about 20 years, and the default GONET matrix amounts to an updated PAM matrix that is based on a far larger data set. For now, [suggest using the default GONET series, but you should feel free to choose alternative matrices and to realign the sequences to get a feel for the effect of the matrix on the alignment. If you do change the matrix, be sure to make the same change in the next set of settings, Multiple Alignment Parameters. 29 30 Chapter 1 Multiple alignment parameters. Choose the Multiple Alignment Parameters from the Alignment Parameters menu to see a dialog box that looks like Figure 1.17 -Muttiple Parameters} Gap Opening {0-100} {10.00 Gap Extention (0-100) {0.20 Delay Divergent Sequences () DNA Transition Wetght (0-1) {0.50 [Use Negative Matrix{_OFF Protein Weight Matrix: OQ BLOSUM series O PAM series }@ Gonnetseries Q Identity matrix | User defined Toad protein matric | DNA Weight Matrix jes O CUSTALW(1.6) User defined Figure 1.17 Again, for DNA sequences I like the default settings, but for protein sequences I prefer to change the Gap Opening Penalty to 15.00 and the Gap Extension Penalty to 03. Delay Divergent Sequences determines how different two sequences must be in order for their incorporation into the multiple alignment to be delayed. prefer to set this to 25%, but you can use the default value of 30% if you like. If you chose an alternate Protein Weight Matrix for pairwise alignments, be sure to choose the same matrix for multiple alignments now. Stetiadl Format. The last setting you need to apply before performing the align- ‘ment is the format for the output. When it creates an alignment, ClustalX ‘writes that alignment to your hard drive in the form of an output file. The format ofthat file is user-determined, and the user makes the decision based ‘on the needs of the program that will use the alignment file to construct a phylogeny, or for any other purpose. Choosing Output Format options lunder the Alignment menu will display the dialog shown in Figure 1.18. [===> output Format Options — ‘CLOSE ‘Output Files: ACLUSTAL format [] NBRF/PIR format CGCG/MSF format 4 PHYLIP format Ciopeformat —_ ANEXUS format GDE output case: CLUSTALW sequence numbers output order Parameter output Figure 1.18 In the Output Files section you can check any or all of the boxes (you must check at least one); the default is CLUSTAL format, ClustalX will create and write an output file for each of the boxes you checked. The files will be writ- ten to the same folder (directory) that contained the input file. We will eventually be using PAUP" to create phylogenies from the ClustalX output, and for that purpose the Nexus format is most convenient. That out- put file will have the suffix .n A 8 © D A ce Shown in tabular form, the results are startling: Taxa Unrooted trees Rooted trees _ Comment 4 3 6 8 10395 135,195 10 2,027,025 34,459,425, 2 3x10" Almost a mole of trees 0 3x 10% ‘More trees than the number 100 2x10" of atoms in the universe Tutorial: Create a Tree! 43, 46 Chapter 1 Using ClustalX to Create a Neighbor Joining Tree If it is not already running, start ClustalX, choose Load Sequences from the File menu, and open the genones.aln alignment file. Everything having to do with creating an NJ tree is implemented from the Trees menu (Figure 1.24) Colors Quality Help Help ‘Draw N-J Tree Bo NeJTree Exclude Positions with Gaps Correct for Multiple Substitutions Save Log file ‘Output Format options Figure 1.24 Although ClustalX creates trees, it does not draw or display those trees on the screen. Instead, it saves trees in files that can be understood by other pro- grams that do draw trees. Just as there are several formats for alignments, there are several formats in which ClustalX can write trees. From the Trees menu choose Output Format Options (Figure 1.24) to display the dialog in Figure 1.25, Output Tree Format Options ‘Output Files Caustattormattree — (jPhylip formattree| | Cl Phylip distance matrix (@ Nexus format tree Bootstrap labels on: [_NODE Figure 1.25 Both TreeView, which you will use during this tutorial, and PAUP, which you will use later, use the Nexus format, Check the Neus Format Tree box, and change Bootstrap labels on: to Node, then click the Clase button to dis- iss the dialog, Choose Draw N-J Tree from the Trees menu (Figure 1.24). A dialog will appear showing that the files will be saved into the same location as the input file, the . Gnd file, and the various alignment files. Click OK to save the tree files as genomes.tre and genomes. treb. Now choose Bootstrap N-J Trees from the Trees menu and again click the OK button on the resulting dialog, Tutorial: Create a Tree! ‘The bootstrap operation will take a few seconds. (For the moment, don’t worry about what bootstrap is. We will get to it later) You are done with ClustalX and you can quit the program by choosing Quit under the File menu. Drawing the Tree Using TreeView Start TreeView and open the genomes. tre file to see the TreeView tree win- dow that displays the NJ tree of the sequences in the genomes .aln_align- ment (Figure 1.26). 47 Figure 1.26 In Figure 1.26, GoBS and _AnoGam are two external nodes. Branches from those two nodes join to create the internal node labeled 11. (I have labeled some of the nodes for discussion purposes.) FEZ2 is another external node that connects to node 11 at internal node 12. $601157 and M0296 are two other external nodes that join at the inter- nal node labeled 13, which connects to external node BC1 at 14. 12 and 14 are two internal nodes that join at internal node 15. Tree Formats: Different Appearances of the Same Tree At this point, it is worthwhile to point out the advantages of certain styles of representations of phylogenies over others, The terms “cladogram” and “phy- Jogram” are used here as they are in PAUP*—to refer to styles of drawing 48 Chapter 1 ‘troes—and are not used in their historical senses within the field of phyloge- netics, A cladogram shows only the branching order of nodes. Cladograms can be presented as either slanted (Figure 1.26) or rectangular. By clicking the square cladogram button (arrow, Figure 1.27) the more familiar rectangular cladogram (Figure 1.28) is displayed. Figures 1.26 and 1.28 show exactly the same information. Notice that in each, case the various internal nodes are lined up vertically above one another. In a cladogram, whether slanted or rectangular, the lengths of the branches convey no information whatsoever, only the branching order is displayed. A clado- ‘gram thus displays only the topology of a tree. ‘At this point it is uscful to introduce another term, clade. All of the descen- dants of a common ancestor represented by a node belong to the same clade Tutorial: Create a Tree! 49 defined by that node; a clade is also called a monophyletic group. #=2:1,G035, and AnoGam all belong to the same clade stemming from node £2. ‘A phylogram displays both branching order and distance information. Click the Rectangular Phylogram button (arrow, Figure 1.29) to see the rectangu- lar phylogram view of the same tree (Figure 1.30) fe = Fives $1) ISTE Figure 1.30 Notice that in Figure 1.30, Go85 and AnoGam are still connected at internal node £2, and node 72 is still connected to internal node 12. Now, however, wwe see that the branches connecting GOBS and AnoGam to 11 are shorter than the branches connecting $$01157 and 30296 to 13. This simply means that there were more sequence changes between the common ancestor T1 and GoBS and AnoGam than there were between the common ancestor 13 and 8802157 and 6.30296. The branches are drawn so that their lengths are pro- portional to the evolutionary distance along that branch. 50 Chapter 1 Distance is the number of changes that have taken place along a branch, usually expressed as the number of substitutions per site. A scale near the bot- tom of Figure 1.30 relates the length of a branch to the distance. The appearance of the tree can be changed from the Trees menu (Figure 1.31) as well as by using the various buttons. Window Help >¢ Radial < Slanted cladogram (E Rectangular cladogram [e Phylogram Show Internal Edge Labels Internal Label Font... Choose tree.. | Order... Define Outgroup.. oat With Gutgroup Figure 1.31 [_Print Trees. Bootstrapping a Tree Although we have a sense of the topology of the tree—the order in which the different sequences diverge—we do not have a sense of how reliable these groupings are. Often it is important to get a statistical estimate of the reliabili- ty of some groupings. Bootstrapping is a widely used method for this purpose. (For more details see "Learn More about Estimating the Reliability of Phylogenetic Trees," p. 52.) > Bootstrapping is a method in which one takes a subsample of the sites in an alignment and creates trees based on those subsamples. That process is iter- ated multiple times (a typical number is 1000, although a minimum of 100 can bb used, but 2000 replicates are required for 95% reproducibility) and the results are compiled to allow an estimate of the reliability of a particular grouping, Fortunately, we do not have to create a thousand trees ourselves; ClustalX did it for us in the last operation before we quit ClustalX. The ClustalX dialog that you simply accepted when you saved the boot- strap tree looked like Figure 1.32. The Random number generator seed is simply a starting point for the bootstrap trials. The actual number is not impor- tant, except that its a good idea to change this number if you repeat the boot- strap process for the same data set (or else you are not running independent trials). Neighbor Joining analysis is not resource-intensive, and I recommend that you always use at least 1000 trials for this kind of analysis Random number generator seed [1-100] : Number of bootstrap trials [1-10000] : [1000 SAVE NEXUS TREE AS : [G4 Dur isersibarryManuscripts:Book Com Cone) Figure 1.32 The tree file created by the bootstrapping operation was named genome. treb. In TreeView, open the genome. treb file, The tree (Figure 1.33) looks identical to Figure 1.26 except that the node are numbered, the word ‘Trichotony appears at the node at the extreme left, and the button for show- ing internal edge labels (arrow) is active and darkened. ‘The numbers indicate the number of times, out of 1000 bootstrap replica tions, all the members of the clade descended from the indicated node were Tutorial: Create a Tree! 51 Figure 1.33, 52 Chapter 1 _ Estimating the Rel ne of ‘Consider the. baal ale 3aaase7990 than once, ae a The poctdapnment gh os Tike this 4836024951 GPreTasTsA Tutorial: Create a Tree! 53 together. Bootstrap numbers are usually placed at nodes, but you will some- times find them placed along branches. You can use the menu or the buttons to display the bootstrap tree in rectangular cladogram or rectangular phylo- gram form, Placing the Root of a Tree ‘The root of a tree is a representation of the common ancestor ofall ofthe taxa being considered. Note also the way the tree is presented in Figures 1.26 and 1.33. It would appear that the node labeled Tr ichot omy represents the com ‘mon ancestor of all the sequences. This is not necessarily the case! The program 54 Chapter 1 thas merely chosen one of the sequences arbitrarily to represent the root of the tree. A common mistake would be to use this tree as presented for the final phylogeny. ClustalX creates unrooted NJ trees. The least arbitrary (and there- fore always correct) means to present the tree is to use the unrooted phylogram method. To display this style, choose Radial from the Trees menu or click the Unrooted Tree button (arrow, Figure 1.34). The tree will look like Figure 1.35 (the internal nodes have been labeled as in Figures 1.26-1.30). Edit Style Tree window Help Figure 1.35 Tutorial: Create a Tree! 55 ‘The unrooted tree is unfamiliar to most molecular biologists and does not, at first glance, look like a tree at all. The point of a radial tree is to avoid imply- ing that we know where the root lies when in fact we do not. ‘Obviously, this set of taxa had some common ancestor; the problem is where ‘we should place the node that represents that ancestor—the root. The sequence alignment alone does not provide sufficient information to make that deter- mination, and it is clearly inappropriate to place such an important piece of information completely arbitrarily. The choice of a root is often made on the basis of other information, which must be justified. ‘Toroota tree simply means to choose a point on the tree as representing the earliest time in the evolutionary history of those sequences. This can be done cither by midpoint rooting, or by selecting any one of the sequences as an out- ‘group (a designated outsider to the rest of the sequence). (For more details see “Learn More about Rooting Trees,” p. 56.) TreeView only provicles the option of outgroup rooting. We could arbitrarily designate any of the 13 taxa as the ‘outgroup, but obviously not all rootings are equally likely. Often, a judgment can be made on the basis of what the proteins do and where they come from. Thus, in our example, I know from external data that $801157 and MJ0296 sequences come from the Archaea, and that they are more distantly related to the remaining sequences than the remaining Eubacterial sequences are to each other. We will therefore designate $$01157 and MJ0296 as the outgroup by choosing Define Outgroup from the Trees menu (Figure 1.31) to show the dia- Jog in Figure 1.36. Define Outgroup ingroup: Outgroup: SOTTST Figure 1.36 36 Chapter 1 LEARY WORE ABOUT Rooting Phylogenetic Trees Unrooted trees tell us only about phylogenetic relationships; they tell us nothing about the directions of evolution—the order of descent. Rooted trees tell us about the ‘order of descent from the root toward the tips of the tree, While tnrooted trees are always more “correct” in that they don’t imply knowledge that we do not have, they are considerably less informative. The problem is deciding. where to place the root. Midpoint rooting places the root af the middle of the longest path between the ‘two most distantly related taxa (the heavy lines in Figure 1). Such a placement ‘implies that the rate of evolution has been the same along all branches—something we know is often no! the case. Suppose that two groups descended from a common ancestor, but that one group evolved much faster than the other. Faster evolution means more sequence differences accumulated in one group than in the other over the same petiod of ‘time. Thus the true tree might in fact look like Figure 2, but midpoint rooting, would produce Figure 3. Unless we are sure that evolutionary rates have been con- stant across the taxa being considered, midpoint rooting is risky. a Bc oD a BG > wW\ %/ wy 10 4, aa /s0 a7 Reka ae a > ‘ 7 3\ B 27D (1) Unrooted Tree Q) True Tree. (B)_ Midpoint Rooted Tree ‘The alternative to midpoint rooting is rooting with an outgroup. An otttgroup is a taxon that is more distantly related to each of the ingroup taxa than any of the ingroup taxa are to each other, That definition makes it appear that it should be easy to identify the outgroup among any’set of taxa. The problem is that if evolutionary. rates are unequal, as in the midpoint example, that definition may break down, ‘The usual solution is to find a taxon that is distantly related fo all of the taxa being consideréd, then add that taxon to the tree and use it a8 an outgroup to root the tree, Sometimes finding an outgroup sequence is no more difficult than doing a BLAST. search; other times it is infuriatingly difficult. The problem is that a distantly related sequence may be so distantly related that it does not share a common ancestor with the ingroup sequences; i.e, it is not homologous. Tutorial: Create a Tree! 57 ‘Rooting trees continued) ‘Suppose we want to find a sequence to tise as an outgroup for the taxa on the ‘ree shown in Figure 1.35, We could tum to the BLAST output from which we picked the sequences to include in the alignment (Figure 1.5) and scroll down near the end of the list, where the E-values range from 0,001 up to-0,007. Those proteins should certainly be more distantly related to the set of proteins on the tree than ‘members of the “tree set” are to each other: One of these potential outgroups is a putative Zn-dependent hydrolase with the Gl number 19552860 and an E-score af ‘0,002 when using, the LI sequence as a query. The question is: Is that protein really a homolog of the other proteins on the tree? ABLAST search finds sequences and parts of sequences with some homology to the query sequence. It is also possible to do a BLAST alignment af two sequences to evaluate their homology. Go to the BLAST home page, hUtp:/www.ncbi.nim.nih,gov/BLAST/ In the Special screen, choose Align tsa sequences (bi2seq). In the resulting form, change to BLASTP and enter the accession numbers of the two proteins to be compared. If we compare the putative hydrolase with the protein designated ‘AnoGam on the tree, we find that the hydrolase aligns over about 60% of the length of the AnoGam query sequence and has an E-score of 0.18; it thus appears that the putative Zn-dependent hydrolase is a legitimate outgroup for the set of ‘proteins on the tree, Knowing that, we can download the sequence of this protein, add it fo the alignment, and reconstruct the tree using the putative Zn-dependent hydrolase as the outgroup, ‘When we cannot identify a distantly related sequence that exhibits homology over most of the length of the sequence, we are forced to turn to other means of identify- ing an outgroup. Typically the “other means" is the classically accepted phylogeny of the organisms from which the sequences are obtained, In the example used in this, ‘Tutorial, sequences SSO11157 and MJ0296 are from Afchaea and all the other ‘Sequences ate from Bubacteria, so we used SSO11157 and MJ0296 as the outgroup. Perhaps more typically, ifall of our sequences are from mammals, we might Jook for the sequence of a homologous protein from, say, birds, Again, however, we need to be sure (by doing a pairwise BLAST) that the sequences have been suffi- ciently conserved that the avian sequence can be meaningfully aligned with the ‘mammalian sequences. An outgroup need not consist of a single sequence. For the sequences in Figure 2.63 on page 145, all of the SHV sequences can be assigned to the outgroup and all Of the TEM sequetices to the ingroup: In summary, outgroup sequences used to root trees must be sufficiently con~ served 80 as to be suite that they are actually homologous, and external phyloge- netic information can legitimately be used fo assign Sequences to outgroups. 58 Chapter 1 Double-click the names that you want to assign to the outgroup, or use the ">> bution to assign them. If you make a mistake, use the “<<” button to move a name back to the list of ingroup sequences. Now choose Root with Outgroup from the Trees menu (Figure 1.37), and click the Slanted Cladogram button to see the rooted tree (Figure 1.38). EEE window Hew > Radial Slanted ciadogram Figure 1.37 <>) bairels Ca Figure 1.38 Tutorial: Create a Tree! 59 ‘Compare Figure 1.38 with Figure 1.26, the unrooted tree. The underlying data are unchanged, but rooting the tree now indicates directional evolution. Clicking the Rectangular Phylogram and the Show Internal Edge Label butions displays the moted NJ tree with branch lengths scaled to distances and the bootstrap values indicated at the nodes (Figure 1.39) oa [o Eiee es Figure 1.39 Printing and Saving the Tree Having done all this work to get the tree into just the format you want, now you will certainly want to save that formatted tree. You also will probably want to print it. You may have noticed that when you re-size the tree window, the tree stretches to fill that window. To see how your tree will actually look on the printed page, choose Print Preview from the File menu or click the Preview button (Figure 1.40). Edit Style Tree Window Help Figure 1.40 60 Chapter 1 The resulting window displays the tree as it will be printed. The Print button at the top left of the screen (Figure 1.41) allows you print the tree; the Copy button copies the image to the clipboard so that you can paste it into your favorite drawing program (Canvas, Adobe Illustrator, CorelDraw, ete.), and the Picture button saves the image ina format that can be opened by a draw- ing program, [ copy [Picture] Print | close J Figure 141 ‘You can also Print and Print Preview from the File menu and can Copy from the Edit menu. Alternatively, you can choose Save Graphic from the File menu. Incither case, TreeView for Macintosh will save the drawing in PICT format, while TreeView for Windows will save it in Windows Metafile format. Draw- ing programs for those platforms will open files in those formats. Summary At this point you should be able to * Use a BLAST search to identify a set of sequences that are homologous to a sequence of interest * Select from that set the sequences that will be used to create a phylogeny, + Download those sequences in both FASTA and GenPept formats. * Use ClustalX to create an alignment from the FASTA file of sequences. * Save the results of that alignment in any of several formats. + Use ClustalX to create a Neighbor-Joining tree from the alignment and save the resulting trve file * Use ClustalX to bootstrap that tree and saved the resulting file. + Use TreeView to draw the tree. * Use TreeView to modify the appearance of the tree, to print the tree, and to save the tree in a format that drawing programs can use. It would now be valuable to pick a sequence of interest to you and to go through the same steps to put that sequence onto a phylogenetic tree. After that, move on to Chapter 2 Basic Elements in Creating and Presenting Trees Selecting Homologs: What Sequences Can Be Put ona Single Tree? Homology must be distinguished from similarity. Homology means that two taxa or sequences are descended from a common ancestor and implies that, in an alignment, identical residues at a site are identical by descent. Similarity ‘merely reflects the proportion of sites that are identical. Two unrelated sequences can be aligned and some sites will be identical, but that identity is not the result of descent from a common ancestor. Obviously, no matter how similar they may be, itis meaningless to put two unrelated sequences onto the same tree because the purpose of the tree is to show a process of descent from ‘a common ancestor. Of course, in one sense, all sequences may be descended from a common ancestral sequence. However, as genes or proteins evolve, they diverge from each other to the point where two genes may share so little sequence in com- ‘mon that they resemble each other no more than any two randomily chosen sequences. At that point their sequence homology has disappeared and those two sequences should never appear together on the same sequence-based tree. ‘Trees that incluce non-homologous sequences are published surprisingly often, It is not uncommon for a set of enzymes that shate similar catalytic proper- ties and mechanisms to be given a common designation and subsequently placed onto the same tree, regardless of actual sequence homology. Suppose you have just cloned and sequenced a dibibliacmuctinase (DBM) ‘gene from the Uncommon Vole. You are familiar with a variety of DBM genes from organisms ranging from flatworms to humans and you want to see where your DBM gene fits in. A BLAST search using your DBM sequence as a query tums up many DBM genes, but the lst fils to include a number of well-known ee chapter 2 a o Chapter 2 DBM genes. The obvious thing to do is to use the Entrez browser at NCBI (http:/www.nebi.nim.nih.gov/Entrez/) to individually download those well- known sequences and add them to the FASTA file that BLAST created for you. ‘This is exactly the point at which you are likely to get into trouble, By including those well-known DBM sequences together with your DBM sequence and the homologs that BLAST identified, you will in all likelihood be including non-homologous sequences on the resulting tree. If all the sequences you added are related to each other, then the tree will probably show two distinct groups—your sequence plus its homologs, and the sequences you ‘added—connected by a very long branch. The tree will be misteading because itwill imply an ancestral relationship that is false, More than that, the presence of two unrelated groups of sequences in the alignment will probably reduce the quality of the alignments within the groups, thus distorting the trees of the separate groups. In particular, branch lengths are likely to be distorted. {As long as you are choosing sequences from a list generated by a BLAST search you are pretty safe, but what can you do if you need to add other sequences (especially unpublished sequences) to your list? How do you know if those sequences are homologous to the sequences already on your list? There is no single, well-aecepted criterion for determining whether two sequences are or are not homologous, but a reasonable criterion of non-homol- ogy is that a pairwise BLAST alignment of the two sequences fails to find a sig- nificant alignment of the two sequences. The first step is to construct a phy- logeny using only the sequences that you identified using the BLAST search. Next, pick out representative sequences from each of the major clades on that tree, Finally, under the Pairwise Blast section of the BLAST homepage (Fig- ‘ure 1.1) click the Blast 2 Sequences link to bring up the dialog in Figure 2.1, If you are testing protein sequences, change the Program designation from blastn to blastp (arrow, Figure 2.1), Paste the sequence in question into the ‘upper box, paste one of your representative sequences into the lower box, and click the Align button. Obviously, if the search reports that No significant simitarity was found (Figure 2.2), you will not want to include the sequence in question on your list. If you want to be more conservative you may decide to impose a more strin- gent criterion—for instance, an E value of 10°°"—for inclusion on the list. Figure 23A shows an example of a false tree that was created when all of the known aminoglycoside-6’-acetyltransferases were included on the same tree; Figure 2.38 shows the correct trees for each of the major clades separate- ly. Notice the very long branches connecting the three clades in Figure 2.34. Such long branches may be a "ed flag,” and if you see them ona tree you have created itis a least worth checking to make sure members of the different clades are in fact homologous to one other. ‘Arelated issue arises when dealing with multidomain proteins, The bacte- rial PTS sugar transport proteins, which have four separate functional domains, provide a good example of such a problem. In some cases, the four domains are fused into a single protein; in others two or three domains are fused, with Basic Elements in Creating and Presenting Trees 63 BLAST 2 SEQUENCES ‘Tice dace gna a wo en ges i LAT ge lr Op op 3 eden Tre ene Te Suen nr sn OC Sequence) Saurcas tampon 401 sich aminogenie este 08. SOMBEE aut 543 64 Chapter? “ LO = SONS the remaining domains existing as separate proteins. When multiple domains are present in a single protein, they may be arranged in different orders in different proteins. Alignments of the complete proteins or the genes encod ing them are meaningless. It is necessary to treat the different domains as though they are separate proteins/ genes. Cut the domains apart at domain boundaries as best you can and create separate trees for each domain. Don’t worry too much about the precise boundary positions—a few bases one way or another is unlikely to have a major effect on either the alignment or the tree. Fine-Tuning Alignments BB Chapter 2 Hes ctkan To follow this discussion download the CelF sequence from the “Phylogenet- ic Trees Made Easy” website. Load CelP.aln into ClustalX. In Chapter 1, I stressed that the quality of a tree can be no better than the quality of the sequence alignment that underlies that tree. ClustalX offers quite a few tools to help refine and improve alignments. The casiest of these tools to Basic Elements in Creating and Presenting Trees. 65 use is the histogram displayed below the alignment. The height of each bar indicates the similarity of the characters at that site. In the CelF alignment, the 120-170 residue region looks preity good, whereas the histogram in the 230-280 region is pretty flat (Figure 24) oma fonsne 1] o = se QUstAL-Alonmet ae crested Figure 2.4 66 Chapter 2 ClustalX provides an entire menu, Quality, to deal with determining the local quality of the alignment (Figure 2.5A). Italso provides an excellent Help ‘menu (Figure 2.56) with tips on using various parts of the ClustalX program, including the Quality menu, When in doubt, tum to the Help menu ‘Calculate Low-Scoring Segments ‘General Show Low-Scoring Segments, Input & output Files Show Exceptional Residues siting Alignments Low-Scoring Segment Parameters. Multiple Alignments Column Score Parameters Profile Alignments (a) [Save Column scores to File Secondary structures Tree Colors ‘Alignment Quality Command Line Parameters Figure 2.5 (®) | Meferences Selecting Show Low-Scoring Segments highlights the residues that are caus- ing the low scores. There will always be some highlighted residues as the result of divergence of the sequences during evolution, but strong clustering of high- lighted residues suggests misalignment (Figure 2.6) ‘castank oaH) || Camiite tinment wode] Fone sizet 10 Basic Elements in Creating and Presenting Trees Review the discussion of gap penalties in Chapter 1 (pp. 24, 29-30, 34-37). ‘When penalties are too high, similar residues will not align, resulting in poor «quality. When they are too low, there will be too many gaps, also resulting in poor quality, One way to deal with the problem is to realign problematic residue ranges using different gap penalties while leaving the bulk of the align- ‘ment alone. ClustalX provides the means to do that. Under the Alignment menu in the Alignment Parameters options, choose Reset All Gaps before Alignment (Figure 2.7). Next, change the gap penal- ties using the Pairwise Alignment Parameters and Multiple Alignment Parameters menu choices, Start within a fairly well-conserved region at the left flank of the low-scoring region (the left anchor), then select the range of residues you want to manipulate by clicking and dragging the alignment pane below the residues until you are within a fairly well-conserved region at the right flank of the low-scoring region (the rig aichor). Finally, under the Align ‘ment menu choose Realign Selected Residue Range. COME trees colors quality Help Do Complete Alignment Produce Guide Tree Only Do Alignment fram Guide Tree Realign Selected Sequences Realign Selected Residue Range Align Profile 2to Profile 1 ‘Align Profiles from Guide Trees ‘Allon Sequences to Prone 1 Reset New Gaps before Alignment ‘Save Log File ‘Output Format Options Pairwise Alignment Parameters Multiple Alignment Parameters: Protein Gap Parameters _._ Secondary Structure Parameters [As you vary the gap penalties, note the effects on both the low-scoring region and on the anchor regions. Gap penalties that disrupt the anchor regions are to be avoided; those that improve the low-scoring region while maintaining the conserved flanking regioris are helping, There is no firm guide to modify- ing gap penalties, but if the low-scoring region seems to have few gaps and many mismatching residues, it makes sense to decrease gap penalties; if it seems to have a lot of gaps, try increasing the penalties. {It must be understood that all of this manipulation is an attempt to reflect real events in the histories of those regions. It may well be the case that those are simply regions that have diverged a lot, and no amount of valid manipu- lation is going to change that. o7 It is very difficult to hold an image of the alignment in mind for compar- ison against the changed alignment. For that reason, it is useful to print the original alignment before doing any manipulations. Because each manipulation will result in overwriting existing alignment files, you should move the original output files to a new folder (directory) before doing any manipulations. Better than just printing the alignment file is printing the alignment as itis, displayed in the alignment pane, complete with shaded residues and the qual- ity histogram. See Appendix II for instructions on printing the alignment in that format, Major Methods for Creating Trees Which Method Should You Use? You may already be aware that there are a variety of methods currently being used to construct trees from sequence data, and you may even be aware that the field of phylogenetics is quite contentious with respect to which method is best. IF you ask an evolutionary colleague which method to use, you are like- ly to get an answer such as, “You must use Parsimony” (or Neighbor Joining orMaximum Likelihood, ete., depending on which colleague you ask). “Other methods are just shoddy or worse.” Much of the opinion amounts to reli- gious conviction, and you need not worry about it. You could just stick with Neighbor Joining, but the other methods offer some advantages and some dis- advantages when compared with Neighbor Joining, Itis important to under- stand several methods, to make your choices based on the situation at hand, and not limit yourself simply because NJ was used in the Chapter 1 tutorial. There are two primary approaches to tree construction: algorithmic and tree-searching. The algorithmic approach uses an algorithm to construct a tree from the data, The tree-searching method constructs many trees, then uses some criterion to decide which is the best tree or best set of trees (see “Lear More about Tree-Searching Methods,” p. 70). ‘The algorithmic approach has two advantages: It is fast, and it yields only a single tree from any given dataset. The two algorithmic methods in current use are Neighbor Joining, with which you are already familiar, and UPGMA (which stands for Unweighted Pair-Group Method with Arithmetic Mean). NJ has almost completely replaced UPGMA inn the current literature. Both NJ and UPGMA are distance methods. All the other methods in current use are tree-searching methods. They generally are slower, and some will produce several equally good trees. AC first itmight seem that the algorithmic methods are the obvious choice because they are fast and they result in a single tree that you can publish and get on with other things. At one time the speed issue was important, especially when a Basic Elements in Creating and Presenting Trees dataset included many sequences, Today's fast, powerful desktop computers have greatly reduced the speed problem, and for most datasets the speed advantage of algorithmic methods is negligible. Although it may appear advantageous to have only a single tree to think about, that comfort can be quite misleading because it gives the impression that the tree you see is the right tree. It is essential to understand that the “right tree” doesn’t exist. We are trying to deduce the order in which existing taxa (sequences) diverged from a hypothetical common ancestor and the amount of change along the branches between the diverging events. It is extremely unlikely that those deductions will be correct in every detail so the tree we see ill not be an accurate depiction of histotical events. Even if we are only con- cerned with tree topology, we can never be assured that the topology of the tree accurately reflects the historical branching order. ‘The best we can hope for is a tree that pretty well reflects what happened in the past, while realizing that we don’t know what happened in the past so ‘we can never be entirely sure how accurate the tree is. Tree-searching methods ‘may yield one tree or several, but all methods implicitly acknowledge that the trees produced are only a subset of the possible trees that are consistent with the data, Distance versus Character-Based Methods Thave already mentioned that NJ and UPGMA are distance methods. Distance ‘methods convert the aligned sequences into a distance matrix of pairwise dif- ferences (distances) between the sequences (see “Lean More about Distance Meth- ds,” p. 74). The matrix is much like the tables of “% homology” that often appear when only a few sequences are being compared. Distance methods use that matrix as the data from which branching order and branch lengths are ‘computed. Character-based methods, including Parsimony, Maximum Like hood, and Bayesian methods, all use the multiple alignment directly by com- paring characters within each column (each site) in the alignment. Parsimony looks for the tree or trees with the minimum number of changes (ee “Learn More about Parsimony,” p. 94). Itis often the case that there are sev eral trees, typically differing only slightly, that are consistent with the same number of events and that are therefore equally parsimonious. Maximum Likelihood looks for the tree that, under some model of evolu- tion, maximizes the likelihood of observing the data (see “Learnt More about Max- imum Likelihood,” p. 104). MLalmost always recovers asingle tree, but programs such as PAUP* can be instructed to save multiple trees, An advantage of the ML method is that the likelihood of the resulting tree is known. A disadvantage is that ML is considerably slower than either Parsimony at NJ, and itis not diffi- lt to exceed the capacity of even the most up-to-date desktop computer. Bayesian analysis is a recent variant of Maximum Likelihood. Instead of seeking the tree that maximizes the likelihood of observing the data, it seeks those trees with the greatest likelihoods given the data (see “Learn More about Bayesian Analysis,” p. 120). Instead of producing a single tree, Bayesian analy- 69 Basic Flememts in Creating and Presenting Trees 71 ‘An exhaustive search is carried out by finding each of the possible trees by a ‘branch-addition algorithm. The first three taxa are connected to form the only pos- sible three-taxon tree, one that contains three branches (tree A in the Figure 1). The fourth taxon is added by adding a new branch to the micidle of each of the existing, branches to generate the three possible four-taxon trees (trees B1, B2, and B3). ‘Adding the fifth taxon requires adding a new branch to the middle of each of the five branches in each of the fourtaxon trees to generate 15 trees, This is accom- plished by adding each of the five possible branches to tree B1 to construct trees CIL-CI5, then backing down to tree B2 and adding each of the five branches to make trees C21-C25, then backing down to tree B3 and again adding the five possi- ble branches to make trees C31-C35; If there were six taxa, starting with tree C1 and going through tree C35 seven branches would be added to each tree to make all of the possible trees at the D level. ‘There is an alternative, the branch-and-bound algorithm, that also guarantees finding the best tree but does not require searching every tree. A random tree con- taining all taxa is generated and evaluated. Then, starting at A, the three-taxon {ree in Figure 1, the search moves out toward the tips. It does not attempt to con- struct all possible trees at each level of the search: instead it constructs a single tee, say BI, and evaluates it If the criterion is minimum evolution and the cur- rent tree has a better (lower) score than the random starting tree, the search ‘moves on to the next level by adding another branch, If the current tree has a ‘worse score than the random tree, then it and all other trees that can be derived from it by adding more branches will have worse scores. The branch-and-bound search can thus discard all of its descendants without evaluating them. When that ‘occurs, the search backs up one level, adds a branch somewhere else, and again starts searching toward the tip. Ifthe search gets all the way to the tip and finds a score that is better than that of the random tree, that score now becomes the score against which all other scores are judged. As in the exhaustive search, the entire tree is covered by eventually backing down to the root level and starting out along the path that begins with B2, ‘and then along the path that begins with tree BS. ‘When the number of trees is large and evaluating each tree would be too slow to permit using the branch-and-bouid algorithm, a heuristic strategy is used. A heuristic approach is essentially a hill-climbing algorithm in which an initial tree is selected, then rearrangements are sought that improve the tree ‘There are too many heuristic algorithms to describe them in detail, but one com- ‘mon approach (with many variants) is the stepwise addition method. Itis similar to branch-and-bouind in that it starts with a three-taxon tree, then adds branches to ‘make each of the three possible fourrtaxon trees. The difference is that at this point each of the trees is evaluated and the one with the best score is selected to make the five possible five-taxon trees that can be derived from it. At each level, only the ‘best of the trees at that level is used to add the next taxon. continued next page) 72 Chapter 2 sis produces a set of trees of roughly equal likelihoods. The results of a Bayesian analysis are easy to interpret because the frequency of a given clade in that set of trees is virtually identical to the probability of that clade, so no boot- strapping is necessary to assess the confidence in the structure of the tree. It would be lovely if there were some objective way to select the “best” ‘method for constructing evolutionary trees, but no such way exists, No method is ideal for all performance criteria. Some of the criteria that have been con- sidered are efficiency, robustness, computational speed, and discriminating ability. Efficiency is a measure of how quickly the method converges on the correct tree as the amount of data (lengths of the sequences) increases; robust- ness is a measure of how well the method can tolerate deviations from its assumptions and still ecover the correct tree; computational speed is obvious; and discriminating ability is how well the method guarantees recovering the correct tree. There are often tradeoffs among these criteria in that methods that increase one measure decrease another (Hillis et al. 1996). ‘One might well ask if we don’t know which tree is the true tree, how can. ‘we measure how well a method recovers that tree? Usually, with real data, we cannot. The exception is some experimental evolutionary systems in which alllof the descendants of a single clonal organism are available and the true tree can be known. Attempts to measure the relative effectiveness of methods are Basic Elements in Creating and Presenting Trees usually based on simulations in which a computer generates descendants of some starting sequence according to some evolutionary model. In the end, a set of taxon sequences is generated, but al ofthe intermediate steps are known, so the “true” tree is known. Various methods are then compared to see which best recovers the true tree and under what conditions they do so. The problem is that the methods that work best are those that incorporate the same assump- tions that were used to generate the tree, soit is very difficult to extrapolate ulation studies of method effectiveness to estimate effectiveness with real data. ‘Choosing among the methods is often just a pragmatic matter: If your com- puter takes longer to calculate the tree than you are willing to take, then use a faster method. My own rule of thumb is that Iam willing to use a method that will run overnight while [am home. Therefore, if it takes longer than about 14 hours, [will probably choose another method. If speed isnot an issue, I prefer a Bayesian analysis for several reasons. Fist, can easily evaluate the reliability of the tree without bootstrapping, which is often impractical with ML. Second, lam uncomfortable with seeing only the single tree that NJ and ML produce end having no idea how much it differs from other trees that might be as good. Third, other methods do not allow me to have branch lengths on consensus or bootstrap trees, whereas Bayesian analysis as implemented by MrBayes does that. lemphasize that these are my reasons for a preference, not general reasons. They are personal and should not be interpreted as recommendations. Because time is often the basis for deciding which method to use, [have applied all four methods to the same datasets in Table 2.1. (Don’t worry if you don’t understand the table legend yet. You will after reading the section ‘on that method.) Table 2.1 Comparison of times requited for the four major phylogenetic methods? Numberof Neighbor Maximum sequences Joining” —_—_Parsimony Likelihood Bayesian 10 <001se 003s we 35 min 52 see 0 S001sec —003see = Smin23see The 32 min 30 S0D1sec —O12see = AS min sec 2hr 40 lv Sytehro Seroiting ‘Clean Up All Windows v Main Display #0 Search Status PAUP Help etgucac tot tagcoe cient ese entartene oe otgeattctoccetgeteesttenests eat opctlacghesstetocaesgeseepenaas Figure 2.11 output format before you aligned the sequences with ClustalX? You do rot have to redo the entire alignment. Start ClustalX, pull down the Alignment menu, choose Output Format Options, and select the out- put format that you forgot. Load the .aln file into ClustalX, then use the ‘mouse to select a column of characters in the alignment pane (prefer- ably a column of identical characters) by clicking above that column. Pull down the Alignment menu and choose Realign Selected Residue Range. ClustalX will now write the alignment in the format you forgot to specify earlier. Cops! What do you do if you forgot to choose Nexus (or PHYLIP) as the 79 80 Chapter 2 Creating Neighbor-Joining Trees Using PAUP* PAUP* for Windows/Unix: pages 179-183, PHYLIP: pages 188-190 Pull down the Analysis menu and choose Distance. Using the same Analy- sis menu, choose Neighbor Joining/UPGMA (Figure 2.12). On choosing Neighbor Joining/UPGMA, you will sce the dialog box shown in Figure 2.13. EXCH trees window Help Parsimony. Likelihood Distance Parsimony Settings. Ukelthood Setting Distance Settings.. Heuristic Search. Branch and Bound Search... Exhaustive Search... Evaluate Random Trees... Bootstrap/Jackknife... Quartet Pu i So ‘Star Decomposition Search... Lake's Invariants.. Permutation Tests. Partition Homogeneity Test... Load Constraints... Shaw Constraints.. Figure 2.12 Be sure the Neighbor Joining button is selected and that the Randomly, int tial seed button (arrow) is selected. The initial seed is.a number that is used 4 a seed to generate a random number that is used to break ties. The initial seed is usually based on the time since the computer was started. The actual number is not important, except that itis a good idea to change this number if you repeat the process for the same dataset; otherwise you are not running independent trials. Ordinarily you can accept the computer-generated num- ber and click OK. The PAUP* Main Display window will then show something like Figure 2.14. ‘The Main Display window, incidentally, will keep a record of everything you do. You can choose to print this record in the end if you like. To do this, ‘you can pull down the File menu and choose Print Display Buffer. Basic Elements in Creating and Presenting Trees. 81 Options for Clustering Methods _ get ns | @ Neighbor joining | Huse Bion! method fa show tree Binecrteene ED Entorce topotoaicat constraints (only atlow. |) jomings compatible with coustratattree) | constraints: {mone defined >] show branch tenaths Breakties ~ | O systematically Caxon-order dependent) _Samoomnynti seen Figure 2.13, fen wah cptinaiy eritarien set actos STi erenintered) Seghan canduelur initial sued = 1067690486 ‘ese fod oy cage jotning aatod stra in tree Sutter 82 Chapter 2 Saving the NJ Tree. It is always a good idea to save a tree as soon as it is created, Choose Save Trees to File... from the Trees menu (Figure 2.15), Remember, you will be saving the tree, not any particular appearance of the tree. [REY window Hein Tree info leartrees Root trees Condense Trees. Filter Trees » Sorttrees Describe Trees. Tree Scores > Show Reconstructions Print trees ‘oe Tree-to-tree Distances... Compute Consensu: ‘Agreement Subtrees.- The resulting dialog (Figure 2.16) allows you to assign a name to the tree file. Before you name the file and click the Save button, you need to deal with some options and with the format in which the tree will be saved. Stover Fawaldetene Tee la smatioata execution es ws) Save tretie as smalibataNisre Ge) y Format: Figure 2.16 Basic Elements in Creating and Presenting Trees 83 ‘You will alavays want to save the branch lengths with the tree, Unfortunately, the default is to save tree files without branch lengths. To include branch lengths, click the Options... button (Figure 2.16) to bring up the dialog in Figure 2.17, Tick the Include branch lengths box and click the 0K button to dismiss the Options dialog. Cisave as rooted trees @ & include branch lengths 7 Maximum number of decimal place: Retain user~supplied brauch lengths Ci include “Set storeBriens” command Gi include TAXA block | Include bootstrap Jackknife proportions: ‘As branch tengths ) AS internal node labels (onty for other programs) Both of the above Save @Ailtrees OTrees | Pa NEXUS (no translation table) FREQPARS PHYLIP 3x Hennig®6_ Figure 2.18 Next, pull down the Format menu seen in Figure 2.16 to display list of for- mats (Figure 2.18) in which you can save the tree file. If you choose the default ‘Nexus format the file you save will be a text file that looks like this: 84 Chapter 2 nexus Begin trees; (Treefile saved Tuesday, August 12, 2003 5:27 PM] u pData file = amallpata.ng.tre sNeighbor-joining search settings > Mes (if encountered) will be broken randomly; initial seed - 624502657 > Distance measure = uncorrected (*p") > (ree is unrooted) 1 ‘Translate Lb ie, 2 Gori, 3 THINB, 4 Ez, 5 mbli, 6 mbisi2, 7 cau, 8 Lic, 91d, 10 La tree PAUPi = [cv] (110.1012, (( (20,2714, 420.2457) :0.0950, (5:0.0024,7:0.0022 ) :0.2485) :0.0058,3:0.2572) :0.1620, ( (S:0.0023,8:0.0084) 10 0535, (9:0.0340, 10:0.0396) :0.0277) :0.0318) ; Bnd; Notice that the tree itself does not include the taxon names directly: Instead it includes numbers and a translation table, Other programs that use the Nexus format may or may not be able to carry out the required translation, Ifinstead you choose the Neaus (no translation) format the file will look like this nexus Begin trees; (Treefile saved Thursday, Auguet 21, 2003 5:45 PM) rf pData file + emalipata...tre >Neighbor-joining search settings: > Ties (if encountered) will be broken systematically > Distance measuze = uncorrected (“p") Basic Elements in Creating and Presenting Trees 85 > (Tree is unrooted = te 0.202275, { (mb1521:0.002257, Lic:0.008450) :0.05252 033972,12:0.039592) 0.027702) :0,031829) :0.161984 257175) :0.005792, (mbl1:0.002373, CAU1:0.002243) :0 245462) :0,098027, FEZ1: 0.245693) :0.271364,GOB1:0) ; End; Most programs that use Nexus files can use this format. Choose whichever format you prefer and click the Save button to save the tree file, Please read the section on “Presenting and Printing Your Trees” (pp. 135-147) for important information on opening tree files within PAUP®. —______} Printing the Tree. You now have a Neighbor-Joining Tree, which you can view or print by pulling down the Trees menu and choosing Print Trees (Figure 2.19). On choosing Print Trees, you will sce the dialog box in Figure 2.2. EEE Window tere Gear trees Condense iees.- fnerteee i wees rttyne: Sed casoaram >] ssfcneet a tinewan: [1 [=] * Show Trees wi Showtree mummers Clinclude ute: (Seb: Derenberrees.. 1D et El Tee Scores > te nsee-peovided branch Show Reconstructons Taxon labels [snow branch lengths a root: Revetica————v) || rom: (iehieues — Generate trees. uv: [0257] in. 1: [225] asp) re) == Come) aa) Figure 2.19 Figure 2.20 PAUP? Windows/Unix and PHYLIP users will have to use TreeView as described in Chapter 1 and in the Chapter 2 section on “Presenting. and Printing Your Trees” to display and print trees. 86 Chapter 2 You can now use the Plot type pulldown menu (arrow, Figure 2.20) to see the different available options to view the tree (Figure 2.21). Line widtt Phytogram, Cishowtre¢ Gretetree Unrooted cladogram Unrooted phylogram Figure 2.21 The choices are simply different ways to visually represent the same informa- tion and correspond to the different tree formats in TreeView discussed in Chap- ter 1. Cladograms show only branching order, and phylograms show branch lengths as well. For the moment, leave the Slanted Cladogram choice select- ed and click the Preview button to seea slanted cladogram tree (Figure 2.22). Figure 2.22 Basic Elements in Creating and Presenting Trees The buttons at the left of the tree (Figure 2.23) allow you to Copy the tree to the clipboard so that you can paste it into a drawing program, to Save the tree as {PICT file that most Macintosh drawing programs can open, or to dismiss the tree (Done). If there is more than one tree in memory, the Next Page and Pre- vious Page buttons allow you to scroll through those trees, previous Page Ifyou want to root the tree, click the Rooting button in the Print dialog (Fig- cure 220) to see the rooting window (Figure 2.24). ‘hoose method Yor rooting unrooted tees: @ outgroup roo fl _cfiottree at internal node | EEE ont tava pancomy TE} @ Moke inaroup monopintetic—~~ * | tfmore than one outgroup taxon present: Ft gate utrou perapmeti | Over respectto ingroup El at ctaroanamononttc sister aroup to ingroup Ouniverg rooting: Anestates = [standard] Q Midpoint rooting = ae 88 Chapter 2 PAUP* allows you either to root the tree at its midpoint or to use an outgroup as you did in Chapter 1 using TreeView. I suggest that you choose Outgroup rooting, Make ingroup monophyletic and Make outgroup a mono- phyletic sister group to ingroup, as shown in Figure 2.21. Click the Define Outgroup button to bring up the dialog in Figure 2.25. The outgroup selection dialog in PAUP* works exactly as it does in TreeView (Chapter 1, p. 55). Just double-click the names of taxa in the Ingroup list that you want to add to the Qutgroup list outgroup taxa: sm To butgroup > Figure 2.25 ‘The PAUP* tree-drawing interface gives you more flexibility than does Tree- View. Not only can you display trees with branches drawn proportionally to their lengths (phylogram formats), you can print the branch lengths next to the branches in any of the formats. To do so, tick the Shaw Branch Lengths box in the print dialog (Figure 2.26). Doing this allows you to modify the fonts for recone Gacanguarclaegran >] a tine wat: [FI] E} Cosnowtreemumbers tinciutetite: CSE) Dtwewse pranchenatis Taxon labels show branch tenaths: Font: [Helvetica =) | Font: [Paatina = swe: [J] (Sue) see SES] Figure 2.26 Basic Elements in Creating and Presenting Trees 89 both the taxon labels and the branch length labels. ike Helvetica Bold for the taxon labels, but I prefer Palatino for the branch length labels. You can also determine the width of the lines used to draw the tree. I prefer slightly heav- ier 1.5 point lines. Finally, if you tick the Include Title box (Figure 2.26) you can define a ttle that will be printed on the tree. To define that title, click the Set button. It is always wise to click the Preview button to see that the tree looks the way you ‘want it to. When the appearance is satisfactory (Figure 2.27), click the Print button to print the tree. tm sre Figure 2.27 90 Chapter 2 Figure 2.28 shows the NJ tree from the LargeData set in the phylogram for- mat. Note that the scale for branch lengths is substitutions per site. Figure 2.28 For more about displaying and printing trees, including using TreeView, see “Presenting and Printing Your Trees” later in Chapter 2 (pp. 135-147). Bootstrapping the NJ Tree. It is always a good idea to estimate the confi- dence you should have in your tree. PAUP* makes it easy to obtain bootstrap estimates of that confidence. Refresh your memory about bootstrapping by reading Chapter 1 (pp. 50-53). Basic Elements in Creating and Presenting Trees 91 From the analysis menu choose Bootstrap /Jacknife... (Figure 2.29). Be sure that the analysis method selected (Parsimony, Likelihood, or Distance) is the same as was used to create the tree, EUENTE trees window Help Parsimony Uketibood Distance Parsimony Settings. Likelinnod Seitings Distance Settings. Heuristic Search. ‘Branch and Bound Search.. Exhaustive Search.. Evaluate Random Trees... | Neighvor Joining/UPcMA.. ‘Star Decomposition Search.. take’s Invariants Permutation Tests Partition Homogs Load Constraints... Show Coostraimts Figure 2.29 ity Test In the resulting dialog (Figure 2.30) be sure Bootstrap is selected, enter the desired number of replicates in the box indicataed by the arrow, and click Con- tinue. In the resulting dialog, click Search. Resampling method os © bootstrap []Resample Jenaracters: CD Jackenife witn [50 |*edeietion [> emulate “Inc” recampling Number of replicates: Random number seed: [317696383 Type at search: @ Fullneurstic (Q “Fast” stepwise-adaition ‘QBranch-and-bound © Netahbor-joiing/UPGAIA (stance only) ‘Consensus tree options @ Retain groups with trequency> [50 |>> Include groups compatible with > majority-rule consensus & Show table of partition requencies Don’ show groups with bootstrap proportions < intacker=weight handling Figure 2.30 92 Chapter 2 To display and print the consensus tree, choose Print Bootstrap Consen- sus... from the Trees menu (Figure 231). LTTE Window Help Tree info earTrees Root Trees Condense Trees... Filter Trees Describe Trees... Tree Scores Show Reconstructions... Print Trees... Tree-ta-Tree Distances. ‘Compute Consensus... Agreement Subtrees._ TTT RST Generate Trees... Get Trees from file... 026 SaveTreestofile.. O25 Matrix Representation... Figure 2.31 The bootstrap tree for the sma11Data is not very interesting, almost all clades have 100% confidence, but the bootstrap tree for the LargeData (Figure 2.32) shows confidences ranging from 56% (not very good) all the way to 100% Creating Parsimony Trees Using PAUP* PAUP* for Windows/Unix: pages 179-184 PHYLIP: pages 187-190 You can use the same Nexus file that you used for the NJ tree to make a parsimony tree. Pull down the Analysis menu, be sure that Parsimony is checked, then choose Heuristic Search from the same menu. In the result- ing dialog, just leave everything in its default state and click the Search but- ton. A status window (Figure 2.33) will show you how the search is pro- pressing. When the search is complete, it will show a Close button and will indicate the number of trees that were created. The trees are now in memory. Save them toa file just as you did for the N] tree above. You can preview and print the trees just as you did the NJ tree by selecting Print Trees from the Trees menu, Basic Elements in Creating and Presenting Trees 93. edition sequence: 1+ Trees hele at each ste ‘suepping algces that COLLAPSE cption in effect ULTREES opticn in effect: Steepest descent! KEEPing trees of score { + Towa joined: 1 Paorrenganents tried: ‘trace renoining to susp: Thunber of trees saved pest thee found 20 fort Figure 2.33 94 Chapter 2 LEARN MORE ABOUT Parsimony Parsimony is based on the assumption that the most likely tree is the one that requires the fewest number of changes to explain the data in the alignment. The basic premise of parsimony is that taxa sharing a common characteristic do so because they inherited that characteristic from a common ancestor. When conflicts with that assumption occur (and they often do), they are explained by reversal (a characteristic changed but then reverted back to its original state), convergence (unrelated taxa evolved the same characteristic independently), or parallelism (dif- ferent taxa may have similar embryological mechanisms that predispose a charac- teristic to develop in a certain way). These explanations ate gathered together ‘under the term homoplasy, Homoplasies are regarded as “extra” steps or hypothe- ses that are required to explain the data. More formally, parsimony assumes that a character is more likely to be common to two taxa because it was inherited from a ‘common ancestor than it is to be common because of homoplasy. Parsimony operates by selecting the tree ot trees that minimize the number of evolutionary steps, including homoplasies, required to explain the data, Parsimony, ‘or minimum change, is the criterion for choosing the best tree. For protein or nucleotide sequences, the data are the aligned sequences. Fach site in the alignment is a character, and each character can have different states in different taxa, Not all characters are useful in constructing a parsimony tree. Invariant characters—those that have the same state in all taxa—are obviously use- Jess and are ignored by the method. Also ignored are characters in which a state ‘occur in only one taxon. ‘An algorithm is used to determine the minimum number of steps necessary for any given tree (i.e., any given branching order) to be consistent with the data. That rumber is the score for the tree, and the tree or trees with the lowest scores are the -most parsimonious trees. The algorithm is used to evaluate a possible tree at each informative site. Consider a set of six taxa, conveniently named 1-6. At some site (character) in the alignment, the states of that character are: 1A 2c 3A 4G 56 6C ‘There are 105 possible unrooted trees of six taxa. We will pick the unrooted tree in Figute 1 as our example, but all will be evaluated by the computer. Basic Elements in Creating and Presenting Trees 95 96 Chapter 2 Basic Elements in Creating and Presenting Trees 97 98 Chapter 2 Because two trees were saved, the Print Trees dialog will list two trees instead of one (Figure 2.34), Trees: @o Plot typ show tree numbers [Use user-pre Taxon labels Font: [Reivetica + (Santed cadooram >] ag [joe mewae ne: Set) {ded branch tengtins i Show branch tengths: size: ["] (aba | Size: [12 Jf] [sates —¥) Margins um [625] im. ve: [025 Jin. Max.decimal digits: [ Suppress on terminal branches oa) Te) Co) Cee) Figure 2.34 You can select individual trees to preview or print, or you can select all of the trees at once. just as was the case for the NJ tree, you can display branch lengths, and you can root the tree with an outgroup. Figure 2.36 shows the trees rooted with GOB2 as the outgroup. To display both tees on the same page, click the Trees per Page button (Figure 2. 5) to reveal the dialog in Figure 2.34 and click to select the two boxes that would position the trees above each other @ Number of trees per page: Number of pages pertree: re 2.35 Rows:2 Columns: 1 Trees/page:2 Positioning mode: © “Worizontat” trees ("Vertical trees Basic Elements in Creating and Presenting Trees 99 Please read the section on “Presenting and Printing Your Trees” (pp. 135-147) for important information on opening tree files within PAUP®, Notice that the branch lengths in Figure 2.36 are not displayed as decimal fractions but as integers. For Parsimony, the default is to have the branches indicate the number of changes along that branch. igure 2.36 How can you choose between the two equally parsimonious trees in Figure 2.36? In one sense it doesn’t matter; each of the trees is equally parsimonious and therefore as good as the other tree, so you can pick a tree at random. Anoth- er possibility is to compare the Parsimony trees with the NJ tree (Figure 2.27) and pick the Parsimony tree that most resembles the NJ tree. Ineither case you should indicate, either in the text or in a figure legend, that you are showing only one of 1 equally parsimonious trees. 100 Chapter 2 Creating a Consensus Tree Using PAUP* PAUP* for Windows/Unix: pages 179-185 PHYLIP: pages 187-190 Another option isto present a consensus tree, From the Trees menu select Com- pute Consensus. In the resulting dialog (Figure 2.37), select all of the trees you want to include in the consensus (usually all the trees). Tlike to use the 50% majority rule to compute the consensus, but you can use either strict or semi-strict rules if you prefer. Infact, PAUP* will calculate a consensus tree for ‘each of the options that you check. [ Consensus re options qi taser Gacens EE Majorty-rute [Include ther compatole groupings Bisnow trequencies erat ooservea ‘Trees to includ show tree | inconsensus: Tloutputte treente | va Cicensensus indices Figure 2.37 To view the consensus tree, select Print consensus tree(s) from the Trees menu. The print dialog will by now be familiar, and you can decide on the plot type as you did before. For the consensus tree, the plot type is always a clado- gram and your only choice is the shape of that cladogram. This is because the branch lengths are not determined. You can preview the consensus tree as you would any other. Figure 2.38 shows the consensus tree derived from the two trees in Figure 2.36, The num- bers are not branch lengths; instead they show the percentage of trees in which the taxa above the indicated node are together, Notice that the consensus tree has a polytomy: three branches arising, from a single node. The polytomy is more obvious when the consensus tree is dis- played asa slanted cladogram, as itis in the tree shown in Figure 2.39. If you show a consensus tree you might want to point out that the polytomies rep- resent uncertainty about the branching order. If you wish to choose a single tree to present, in many cases you can choose the tree that most closely represents the consensus. In this case, where the two trees differ by a single node, neither “more closely represents” the con- sensus tree, so the choice is completely arbitrary. Basie Elements in Creating and Presenting Trees 101 ait de . ns) ian eer Figure 2.38 Moyne Figure 2.39 Finally, your last option is to bootstrap the Parsimony analysis. (If you don't remember about Bootstrapping, refer to pp. 50-83 in Chapter 1 and pp. 90-92 in this chapter.) The bootstrap tree, like the consensus tree, will not show branch lengths, but it will show the fraction of the time that a particular clade (group of taxa) are together. Itshould be understood that the existence of several equally parsimonious trees is nota flaw in the program, nor does it indicate a problem with the data. 102 Chapter 2 ‘Multiple troes are often the result of very real polytomies in the tree. Like most of us, phylogeneticists prefer to keep things simple. The simplest situation is a strictly bifurcating tree: from every intemal node there are exactly two branch- € (see “Learn more About Philogenetic Tees,” p. 42). Sadly, evolutionary history is not always so simple, and at times an ancestor may have given rise to mul= tiple descendants within such a short span of time that the order of descent cannot be resolved. The result is multiple branches from an internal node—a polytomy. When the tree representing the history of a large set of sequences includes many polytomies, there may be hundreds of equally parsimonious trees. Is the inconvenience of dealing with consensus trees a reason to simply accept the Neighbor Joining tree and get on with it? Not necessarily. Compare Figure 2.38 with Figure 2.27. Both are derived from the same data. Which is a ‘more accurate representation of history? Ifthe polytomy is real, there is a prob- Jem with the NJ tree in that in PAUP¥ distance trees are strictly bifurcating— no polytomies are allowed. The Largebata produces 32 equally parsimonious trees, far 0 many to show here, but the consensus parsimony tree for the LargeData is shown in Figure 2.40. Because the consensus tree does not show branch lengths, I ‘would probably publish one of the trees with branch lengths displayed, indi- cate in the legend that itis one of 32 equally parsimonious trees, and also pub- lish the consensus tree or a bootstrap tree as a second part of the same figure. Creating Maximum Likelihood DNA Trees Using PAUP* PAUP* for Windows/Unix: Use the instructions in this section without modifi- cation. PHYLIP: Although PHYLIP does create ML trees, it does not do so using the GTR model discussed below. Using PHYLIP to create ML trees is sufficiently ‘complex that it is beyond the scope of this book. PAUP® cannot create Maximum Likelihood (ML) trees from protein sequences, but it does a very nice job with DNA sequences. The number of possible trees depends on the number of sequences in the alignment, but it quickly becomes huge. The number of possible trees depends on whether the tree is rooted or not. For unrooted trees it is (5-5)! 235-3)! where s is the number of sequences. The number of possible rooted trees is (25-3)! 2-2 Basic Elements in Creating and Presenting Trees, Pisny Mae Scie OH acy ae Figure 2.40 ‘Thus, forjust 10 sequences there are 2.03 x 10° unrooted trees and 3.4 x 107 root- ed trees. Its not possible to compare the likelihoods of all possible trees, so the program searches by comparing a tree in memory with a closely related tree and retaining the more likely, repeating that process until no improvement is obtained. Visualize a surface consisting of each of the possible trees for a given num- ber of sequences. The height of each point (tree) above that surface is the like- lihood of the alignment data and the specified model of evolution given that tree. On that surface, the more closely related trees are to each other, the clos- cer together they are. The surface thus consists of hills and valleys, with the most likely tree being, the point that is at the top of the highest hill. The ML method starts at some point (some tree) and tries to find the top of the highest hill by 103 Elements in Creating and Presenting Trees. 105, moving from tree to tree, always accepting moves that go up and rejecting ‘moves that go down in probability. ‘So far we have created trees by executing a data file (the Nexus alignment file that was produced by ClustalX), then using the mouse to select ment items, click buttons, etc. The number of instructions that need to be given to create an ML tree is large enough that itis actually easier to use an alternative way, the command line interface, to tell PAUP* what to do with the data. Indeed, Basic Elements in Creating and Presenting Trees 107 the command-line interface is the only option available to those who run PAUP* under Windows or Unix. While it is possible to issue individual com- mands by typing in the little one-line window at the bottom of PAUP*’s main ‘screen, that is generally a bad idea. tis far too easy to mistype one word and not have anything work. Chapter 2 Files: ML coding PAUP block and ‘ML non-coding PAUP block ‘The better way to use the command-line interface is to put all of the com- ‘mands together into a PAUP block that follows the data block in the input file. To make things easy I have included some example PAUP blocks as files on the web site, You can copy those blocks then modify them slightly to create ML. trees from your own data. Which block to use depends on whether or not your sequences are coding regions or not. If they are coding regions, you can make a better tree by considering the first, second, and third positions in each codon differently. Duplicate your Nexus alignment file, rename the copy something like My£ile Mi.nxs, and do everything to the copy! The alignment file looks like thi yuexus: Begin data; Dimensions ntax-20 nchar=960; Format datatype=DNA gap=-; Matrix Lie atgegttctaccctgctegecttegcectctegteget cgecctagcegeca cop. ——_______atgagaaattttgct THINS —atgacactattagegaagttgatgctagegacggttgegaccat FEZ1 ———__argaaaaaagtatta, bly) ———_____-argaag, mb1511 : cag, ——_—-atgaag Lic atgegttttaccctgctegecttcgcects: Lid atgegttctacectgctegecttcgcectg Li atgegttctaccctgctegecttcgccctg -geegtege end; It consists of a single block, the data block, that immediately follows the word #Noxus. Itbegins with Begin Data; and ends with End ;. Similarly, the PAUP block will begin with Begin Paup; and end with end;. In the Nexus for- ‘mat all command lines end with a semi-colon. ‘The ML. coding PAUP block looks like this: 108 Chapter 2 begin paup; get auroclos charset first = 1-.\3; charset second = 2-.\3; charset third = 3-.\3; charpartition by codon = 1:first,2: jecond, 3:third set criterionsparsimony; heearch; set criterionelikelihood; Leet ne stimate basefreq=est imate ratesssitespec siterates=partition:by_codon; Ascores 1; Lset rmatrix: siterates = prev; heearch atar (this is a comment] savetrees brlens-yes maxDecimales4 filesoutput.nl. trees wey basefreq=prev rate: dtespec ‘The format for PAUP commands is to begin a line with a command such as set, followed by one or more option settings for that command. The command is terminated by a semicolon. A command does not have to be typed ona sin- gle line because it is the semicolon that terminates the command. ‘The first command in the above PAUP block is. set autoclosesyes warnrese A Command Reference documentation file, Cné_ve£_v2..pd¢, is available from http://paup.csit.fsu.edu/downLhtml; you should be sure to download it. ‘That documentation is a command reference list that is not very user-friend- ly, but you can use it to look up each of the commands that PAUP will recog nize. Let’s consider each line in the PAUP block to understand what it does. ‘The set: command on the fist line of the block sets a variety of options. The option autoclose = yes sets the status window to close at the end of the search; warnveset = no tumsoff a user waming that a data block has already been processed; increase = auto automatically increases the maximum number of trees if that maximum is reached. ‘The next four commands are charset first = 2-.\3; charset second = 2-.\3; charset third = 3-.\3; charpartition by_codon = 1:first,2:second,3:third;

También podría gustarte