Está en la página 1de 8

a.HowwouldyougoaboutannotitatingthefollowingtwohypotheticalsequencesintheT.brucei genome.Explainyourmethods,resultsandconclusions. *Notetoself*T.bruceiisaparasiticprotist Sequences:GI:72387411andGI:71744048 Iwillbeperformingablastpsearchtofigureoutiftherewerehomologousannotatedsequences existingintheNCBIBlastDatabase. 72387411:HypotheticalProteinannotation Step172387411fastafilewasdownloadedfromNCBIwebsite http://www.ncbi.nlm.nih.gov/protein/72387411 Step2Performablastpsearchonthenrdatabasewithdefaultparameters Result: Igotmultiplegoodhitwithareallygoodevaluewhichallofthemhaveevaluesofsmallerthan2e71 andgoodIdentityhitonmanyofthesequenceswithabove50%.Someofthesequenceindicateother hypotheticalprotein;howeveracoupleofsequenceindicateitisRIOkinasefromYeast,Homosapien, Daniorerio,Nicotianatabacum,Musmusculus.

Conclusion: Thereisagoodamountofevidencethatthishypotheticalproteinismostlikelyasequencethatis relatedtoRIOkinaseduetothesmallevalueandaprettygoodconsensusmatchingonotherRIO kinases. 71744048:HypotheticalProteinannotation Step171744048wasdownloadedfromNCBIwebsitehttp://www.ncbi.nlm.nih.gov/protein/71744048 Step2Performablastpsearchonthenrdatabasewithdefaultparameters Result: IgottwodecentgoodhitwithagoodEvaluewith8e122and1e87butadecentIdentityhit(33%)on otherhypotheticalproteininsideTrypanosomaorganism.IalsogotabunchofmediocrehitonTRAF2 andNCKinteractingkinasewhichhadalowbutacceptableEValueof3e04withadecentIdentityof (33%).TheresultalsoindicatethatthereisapossiblebindingsideforPKC. Conclusion: ThereissomedoubtbutagoodamountofevidencethatthishypotheticalproteincontainsaPKC bindingsiteandinliteraturePKCisinvolvedinthephosphorylationofTRAF2whichisalsopredicted bytheNCBIblast. Reference: (1)LiS,WangL,DorfME.PKCphosphorylationofTrAF2mediatesIKKalpha/beta recruitmentandK63linnkedpolyubiquitination.MolCell.2009.33(3042). b.ExplainthedfferencebetweenEvalueandPvalue Evalue:isanindicatorofThenumberofdifferentalignmentswithscoresequivalenttoorbetterthanS

thatareexpectedtooccurinadatabasesearchbychance. Pvalue:isanprobabilityofagivensequencetooccurbycahnce. BasicallythedifferentisthatEvalueisdescribingthenumberoftheeventstohappenwhilePvalueis describingthelikelihoodofaneventtohappen;however,underthecurrcumstancethattheEvalueis small,pvalueisprettymuchthesameasEvalue. c.Howmanyglobalalignmentscanyoucreatebetweenthefollowingtwosequences? (2n)!/(n!)^2=(22!)/(11!)^2=705432 Findoptimalalignment +1formatch1formismatchesand2forgap Sequence1:CRELANCANTHSequence2:PELICANSKK CRELANCANTH ||||| PELICANSKK

Section2:Computationalgenefinding(Xu) a.DicodonbiasesinregionswithhighG/Ccontent i)Howtocreatefrequencytable Assumptions: 1.WealreadyknowwhatistheproteincodinggenesinHumanandthenonproteincodinggenesin Human. 2.WealreadyknowwhatregionisthehighGCcontentandwhatregionarelowGCcontent. 3.WealreadyknowwhatportionofthegenomeiscodinggenesandnoncodinggenesfromhighGC contentregion 4.WealreadyknowwhatportionofthegenomeiscodinggenesandnoncodinggenesfromlowGC contentregion 5.Inrealitywewillperformthefollowingalgorithmforeachindividualchromosome,buttoavoid confusionwewilltreattheentiresequenceasonecontinuoussequence. Wewilldevelopfourdifferentdicodonfrequencytable. 1)builtuponhumangenome'sdatasetofgenescomingfromahighGCcontentregion. 2)builtuponhumangenome'sdatasetofnoncodinggenescomingfromahighGCcontentregion 3)builtuponhumangenome'sdatasetofgenescomingfromalowGCcontentregion. 4)builtuponhumangenome'sdatasetofnoncodinggenescomingfromalowGCcontentregion Specificallyhowwewillbuildthedicodonfrequencytableistoperformthefollowingalgorithm. Input:Sequencesfromthe4differentconditionsmentionedabove. Output:Dicodonfrequencytablebasedontheinputtedsequences Step1.Generateamatrixof20by20aminoacidtable,rowsofthetablerepresentthefirstcodon theandcolumnsrepresentthesecondcodon,andwewillfillinthedicodonfrequencytableinthenext couplesteps.

Step2.Scantheentiredatasetwithawindowsizeof6nt(representingthedicodon)andforeachof thefirst3ntwewilltranslateintoacertainaminoacidandthelast3ntwewilltranslateintoanother aminoacid,sowhatwedidhereiswemadethe6ntintoadicodon. Step3.Basedonthescanneddicodon,wewilladdupthecountintothedicodonfrequencytable throughoutthescanningandwewillalsokeeptrackofthetotalcount. Step4.Finallywewilldivideeachcellwiththetotalcountandgettingafrequencyforeach particularcombinationofdicodon. (Asmentionedbefore,wewillperformthisforall4differenttypesofsequencecoding&highGC, noncoding&lowGC,coding&lowGC,noncoding&lowGC) ii)WhichModel? Thepreferencemodelcanbeusedinthiscase.TohelpourdiscussionwedefineSasthefrequencyof aingeneregioncodon,andBascodonwithinthenoncodingregion.Thepreferencemodelusesthe followingformulalog(S/B)IfSisgreaterorsmallerthanBthenitindicatethereisapreference.Inour casetheresultgreaterthan1meansitpreferstobeinthecodingregion,smallerthan1meansthe dicodonpreferstointhenoncodingregion. iii)Howtoexecutethemodelinhumangenome Step1.WewilltakenoteofthehighGCcontentlocationandlowGCcontentregionofthegenome. Step2.CalculatethealltheORFsoftheDNAsegmentwecanignoreORFsthatisshortwecanset aparameterfortheusertodecidethisstep. (Note)Thereare6differentORF3ineachdirectionandORFsareregionsbetweentwo stopsite. Step3.WewillthenscantheORFswithawindowsizeof60bpinincrementof10bp.Ifwearein GChighthenperformStep4,ifnotgotostep5. Step4.(GCHighregion)Wewillcalulatethepreferencescoreusingthedicodonfrequencytablefor theGChighregionforeachwindowside.Wewillthenaddupthescoresandassignittothepositionof themiddleofthewindow. Step5.(GClowregion)Wewillperformthesamecalculationasinstep4butusingthelowGC contentdicodonfrequencytable. Step6.TheresultfromStep2andStep3arerecordedtothesamefileandplottedxaxisisthe positionandyaxisisthepreferencescore. Step7.Nowwewilllookattheplotanddetermineregionsthathaveastrongpreferenceforcoding ornoncoding. (Note)60bpwindowsizeandtheslidingwindowincrementof10bpismerelyaparameterandtheycan bechangedoradjustedtogetthebestrest. b.Howtofindexonsusingproteinsequences Exonsarethesequencesthatcodeforproteinthereforeifwehaveaproteindatabasewecansearchthe nucleotidesequencefortheregionthatcodedtheseproteins,andtoexcludeagoodnumberoffalse positive,wecanbestringentintheparameteroftheblastsearch. WecanthenperformTBLASTN,blastingeachproteinonacollectionofindividualchromosomeofthe eukaryoticgenome.Foreachblastsequencehit,theremustbeanexactmatchinidentityandalowe value.WewouldbeusingTBLASTNtoperformthesearchandtopreventfalsepositivewewillneed

toadjusttheparametersinawaythatitisstringentinitsselection. Wewillsuggestmodifyingtheparametersinthefollowingway: 1.settingtheevaluetobelow 2.maxtargetsequencetobehigher 3.gappenaltytobehigher. Reference: [1]Cheapter13.NCBIBLASTReferenceAccessedonOct2nd2009 http://etutorials.org/Misc/blast/Part+V+BLAST+Reference/Chapter+13.+NCBI BLAST+Reference/13.3+blastall+Parameters/ c.UnderwhatsituationcanwecombinetwobacterialgenomesAandBandhowcanwecheckit. Tosolvethisproblem,willidentifyagenefindermethodologythatusesthedicodonfrequencytodo genepredictioninbacterialgenome.(Wedidn'tspecifythemethodologybecausewedon'twanttobe tailorouralgorithmforaspecificmethodology) OneofthebasiccritereonforustocombinethebacterialgenomeAandBisthattheymusthave similardicodonfrequenciesinthecodingandnoncodingregioninordertocombinetheirresultto maketheprediction. Step1.Obtainsequencesfromthelimitedhomologuehits(WecouldalsogetthesequencefromEST database;however,wegetintotheriskoftheseESTsequencemightcontainncRNAs) Step2.Wewilldedicateabout90%ofthepositivedatasetabovetotrainourmodeland10%ofitfor testing. Step3.Wecanthenbuildthedicodonfrequencytableofthecodingregionandnoncodingregion withingenomeAandBbyusingthecodinghomologuesequences. Step4.Wecanobtainthenegativedatasetorbackgrounddicodonfrequencybyperformingrandom shufflingonthecodingregionanduseittocalculatethedicodonfrequencytable.Topreventbiasin ourresult,wewillgetthesameamountofdatasetasfromthepositivedataset. Step5.Wecanthenusethemethodologyandthedicodonfrequencytomakegenepredictiononthe regionsurroundingthetestdatasequence. Step6.Wewillcalculatethesensitivity,specificity,andaccuracyofthepredictionbasedonthe numberoftruepositive,falsepositive,truenegative,andfalsenegative. Step7.Wewillrepeatstep3to6butthistimebuildingthedicodonfrequencytablebasedontheeach individualgenomeAandBandperformthetestsoneachindividualgenomes.Wecancomparethe resultwiththecombinedgenometrainingdatasettoseeiftheSensitivity,Specificity,andAccuracyis betterornot. *Note*Wedon'tnecessarilyneedtousethedicodonfrequency,butifthereareotherdistinguishing featurethattakesadvantageofsomefeaturewithinthebacterialgenomethenwecanalsousetheabove algorithmtoperformthetestofwhetherornotwecancombinegenomeAandgenomeB. Section3:Predictionoftranscriptionregulatorybindingsites(Olman) a.

ACGTAC CGTACA GTACAC TACGCA Thefrequencyforeachsequenceisthefollowing f(A)=2/6 f(C)=2/6 f(G)=1/6 f(T)=1/6 Entropyformulaused(f(A)*log(f(A)+f(C)*log(f(C)+f(G)*log(f(G)+f(T)*log(f(T))/ (f(log(4)) Position1theentropyis1 Position2theentropyis1 Position3theentropyis1 Position4theentropyis1 Position5theentropyis0.5 Position6theentropyis0.5 Thesumofalltheentropyis5. Highertheentropythelowertheinformationcontent b.IftheHammingdistancebetweensequenceis<=1thentheinformationcontentisratherhighas theminimaltotalsumofentropyis0.Thereasonisthatifthereisn'tanymismatches(Hamming Distanceis0)thenitindicatesthatthereisareallystrongpreferenceinthosepositions.Accordingto th4sequencesoflength6the6positionsdoesn'thaveanyvariationsoineachpositiontheentropy wouldbe0;therefore,theminimaltotalsumofentropy(informationcontent)wouldbe0. Considerthefollowingsequence ACGTAC ACGTAC ACGTAC ACGTAC c. Thefollowingareall16posiblecombinationforthethreenucleotidesequencetooccurandtheir respectiveprobability A>T0.4T>G0.35(0.4*0.35)=0.14 A>T0.4T>A0.2(0.4*0.2)=0.08 A>T0.4T>C0.2(0.4*0.2)=0.08 A>T0.4T>T0.25(0.4*0.25)=0.1 A>C0.3C>T0.5(0.3*0.5)=0.15 A>C0.3C>A0.1(0.3*0.1)=0.03 A>C0.3C>C0.1(0.3*0.1)=0.03

A>C0.3C>G0.3(0.3*0.3)=0.09 A>G0.17G>G0.8(0.17*0.8)=0.136 A>G0.17G>A0.06(0.17*0.06)=0.0102 A>G0.17G>C0.3(0.17*0.3)=0.051 A>G0.17G>T0.1(0.17*0.1)=0.017 A>A0.13A>T0.4(0.13*0.4)=0.052 A>A0.13A>A0.13(0.13*0.13)=0.0169 A>A0.13A>C0.3(0.13*0.3)=0.039 A>A0.13A>G0.17(0.13*0.17)=0.0221 ThemaximumlikelihoodaccordingtotheMarkovmodelwouldbeSequenceACTwith0.15 d. Position1 FreqofA=5/8 FreqofC=2/8 FreqofT=1/8 FreqofG=0/8 Entropy=(5/8*log(5/8)+2/8*log(2/8)+1/8*log(1/8)+0*log(0))=0.65 Position3 FreqofT=5/8 FreqofC=1/8 FreqofA=1/8 FreqofG=1/8 Entropy=5/8*log(5/8)+1/8*log(1/8)+1/8*log(1/8)+1/8*log(1/8)=0.77 TheentropyforPosition1islowerthanPosition3meaningPosition1haveahigherinformation contentandismoreconserved. Themaindifferenceisbecauseinposition3weknowfreqofGis1whileposition1weknowfreqofG is0thereforemakingposition1tobemoredeterministicandlessrandomthanposition3. Section4:Computationalproteinstructureprediction(Xu) a.CreateanalysisproceduretovalidatewhetherProlineingeneralshouldappearinalpha helices. ThebasicideaistofindthenumberofoccurrenceoffrequencyofProlinethatoccurwithinthealpha helicesandcomparetotheexpectedfrequencyiftheaminoaciddistributionwithinalphaheliceswas evenlydistributedasineachaminoacidshouldhave5%chanceofoccuringinsideanalphahelices. Therefore,ifProline'sfrequencyinsidethealphaheliceswaslowerthan5%thenthereisanindication thatstatisticallyProlinedoesnotliketobeinsideanalphahelices.Toexplaintheprocessindetailwe presentthefollowingprotocol: 1.ObtainalltheaminoacidsthatappearinalphahelicesfromthePDBdatabase.

2.Calculatethefrequencyofoccurenceofeachindividualaminoacid. 3.IntheprocesstheoccurrenceofProlinewillbecalculated 4.IfthefrequencyofProlineissignificantlylessthan0.05thenwehaveareasonableamountof evidencethatProlinedoesnotappearinthealphahelices. (Note)TomakethecomparisoneasierwecandothefollowingLog(TrueFrequencyofProlineinan alphahelices/HypotheticalBackgroundFrequencyofProlineinthealphahelices)Ifthevalueis greaterorequalto1torejecttheconjectureandlessthan1toaccepttheconjecture. b.FeaturesA,B,andCselectionofTransmembraneregioninaproteinsequence.Design computationalprocedure(ataconceptuallevel)thatcancheckiffeaturecandecideifasuggested featureisuseful. Outline:WeareperformingfeatureselectionthroughanExhaustiveapproach. AssumptionsandDefinitions: DefinesuccessasbeingthefeatureidentifiedthesequenceasTransmembraneregion DefinefailureasbeingthefeatureidentifiedthesequenceasnonTransmembraneregion *Note*Theprogramassumesthatthefeaturescouldalsobeacombinationoffeatures,sothefollowing algorithmdoesnottestthecombinedeffectsofvariousfeatures. 1.ObtainagoodamountofrandomsampleofproteinsequencethatcontainTransmembraneregion andNonTransmembraneregion. 2.Basedontheselectedfeaturefortestingwewillapplyawindowsizeandperformthesliding windowoneaminoacidatatime,andthefeaturewillassignsuccessorfailurebasedonthefeature criteriaonthescannedsequence. (Note)Windowsizecanchangebaseonaveragetransmembranesize,feature'srequirement,oruser defined 3.Sinceweareevaluatingeachfeature'sgeneralperformancebasedonthefeature'sassignmentofthe regionassuccessorfailurewecanassignthatregionaseitherTruePositive,FalsePositive,True Negative,andFalseNegative.Thiswaywecancalculateitssensitivity,specificity,anditsaccuracy. 4.Nowwehaveanumberlikeaccuracytomakethecomparisonbetweendifferentfeatures. *OtherNote* Fromthequestionitseemstodidn'tindicatethatweneedtoperformliteraturesearchtofindthe possiblefeatures,butjustincaseitwasoneoftherequirementforgrading. Features 1.Hydrophobicityanalysis,TransmembranearegenerallyHydrophobicandnontransmembraneare Hydrophilic. 2.Chargedistributiononinsideandoutsideloops 3.Aminoaciddistributionsinvariousstructuralparts. References [1]Kyte,J.&Doolittle,R.F.(1982).Asimplemethodfordisplayingthehydropathiccharacterofa protein. J.Mol.Biol.157,105132. [2]Boyd,D.,Manoil,C.&Beckwith,J.(1987).Determinants

ofmembraneproteinstopology.Proc.NatlAcad.Sci. USA,84,85258529. [3]G.ETusndyandI.Simon(1998)PrinciplesGoverningAminoAcidCompositionofIntegral MembraneProteins: ApplicationstoTopologyPrediction."J.Mol.Biol.283,489506 c.Findfeaturefordeterminingdisorderedregions ThefollowinglistoffeatureswereobtainedfromLiteratureSearch 1.Certainaminoacidoccuratahigherfrequencyindisorderedregionthaninorderedregion. 2.Aromaticaminoacidsarenotaslikelytoappearinlongdisorderedregions. 3.CystinandHistinearelesscommonindisorderedregionwhileGlustamine,Asparagine,andLysine aremorelikelytooccurindisorderedregion. 4.Serinecanincreasesolubilityandprovideaflexiblelocustothedisordredregion. 5.Lowoverallhydrophobicity 6.Largenetcharge 7.Capabilityforaresiduetoformpairwisecontactsorthecalculationofpackingdensityforeach aminoacidresidue. Byhavingthesefeatureswecantakeadvantageofthealgorithmofvariousmachinelearning's capabilitiestocategorizethedisorderedregionfromtheorderedregion. SimilartotheCatandDogcategorizationanalogyinclass,wearetryingtocategorizethedisordered regionfromtheorderedregionbutusingthefeaturesobtainedfromobservedcharacteristic accumulatedthroughouttheyears.Throughthisprocesswewillalsobeabletofigureoutwhatcauses aregionofproteinsequencetobedisorderedandwhatcausesaregionofproteinsequencetobe ordered. References: (1)Garner,E.etal.(1998)Predictingdisorderedregionsforaminoacidsequence:common themesdespitedifferingstructuralcharacterization.GenomeInform.Ser.Workshop GenomeInform.,9,201213. (2)Kissinger,C.R.etal.(1995)Crystalstructuresofhumancalcineurinandthehuman FKBP12FK506calcineurincomplex.Nature,378,641644. (3)Burley,S.K.andPetsko,G.A.(1985)Aromaticaromaticinteraction:amechanismof proteinstructurestabilization.Science,229,2328. (4)Uversky,V.N.etal.(2000)Whyarenativelyunfoldedproteinsunstructuredunder physiologicconditions?Proteins,41,415427. (5)GalzitskayaOV,GarbuzynskiySO,LobanovMY.(2006)FoldUnfold:webserverfor thepredictionofdisorderedregionsinproteinchain.Bioinformatics.22,29489

También podría gustarte