Está en la página 1de 5

Manuel Castanon/ s3533799

City Campus

BIOINFORMATICS ONLINE TEST #1 2015


This test is to be completed during the time of the class (2 hours). It is to be done under exam
conditions. You may refer to any notes, books etc. that you need. This assignment will be
marked out of 24, and is worth 15% of the final assessment for this course.
The easiest way to complete the assignment is to insert your answers in this document. Make
sure you save it in your student drive (not on the desktop!). Also, put your name and
student number as a header in the word document. To do this Click View on the menu
bar, then click Header and Footer and insert your name and number in the box that
appears). Submitted assignments with no name inserted will lose one mark.
Submit your results to the test 1 box on the learning hub.
Remember, the assignment is to be in your own words- no copying and pasting from
websites (or your neighbour!). The exception of course is the results from tools that you have
used. Lastly, ensure that you manage your time so that you can complete all of the
assignment.
Good luck, Peter

Part A
WWW Data Extraction Exercise. Use the NCBI website (8 marks)
Question 1. (4 marks)
Type 4 Hemochromatosis
Using appropriate databases and analysis tools, find out all you can about this inherited
disease. You may use dot points where necessary to illustrate key points.
It is a disorder that causes the body to absorb too much iron from the diet. This excess iron is
stored particularly in the liver, heart, skin, pancreas and joints. It is also called an overload
disorder because humans are not able to increase the excretion of iron, excess iron can cause
damage to tissues and organs.
Some symptoms of Hemochromatosis can be fatigue, joint pain, abdominal pain, and loss of
sex drive. Later symptoms can be liver disease, diabetes, heart abnormalities and skin
decolouration.
Hemochromatosis Type 4 is also called ferroportin disease and it is an adult-onset disease.
Men can develop symptoms between the ages of 40 and 60 years. Women generally develop
them after menopause.
Ferroportin-1 is a transmembrane protein that transports iron. It consists of 571 amino acid
residues. Mutation of this protein can cause lower activity in the transport role it has. It is
encoded by the SLC40A1 gene which has been mapped to human chromosome 2q32.
Type 4 Hemochromatosis is considered a rare genetic disorder, it has been studied only in a
small number of families worldwide. It is distinguished by its autosomal recessive pattern of
inheritance. This means that one copy of the altered gene in each cell is enough to cause this
disorder.
Question 2. (4 marks)
1. How many publications by Terence are indexed in PubMed (surname is in capitals)?
Make sure you format your query correctly!
111

Manuel Castanon/ s3533799

City Campus

2. How many reviews has he published? (Hint, use filter function on the left). What was the
date of the latest review, and how many papers have each of his co-authors on that paper
published? Include all papers they have published, not just reviews.
He has published 5 reviews. The date of the latest review was August 2010.
Piedrafita D has published 48 papers. Smith RE has published 1047 papers. Radashma
HW has published 66.
3. How many of his publications (not just reviews) had David PIEDRAFITA as a co-author
(either just PIEDRAFITA or with others)?
2
4. What is the general field that Piedrafita and Spithill study together?
Inmunology

Part B
Working with sequences (10 marks)
Question 3. (5 marks)
1. Analyse the nucleotide sequence in the anonymous.txt file (pasted below). Does the DNA
sequence potentially encode a protein? If so, how many amino acids in the longest
reading frame? Yes it could potentially encode a protein. The longest reading frame is 487
aa long.
2. Are there 5 and/or 3 non-coding regions? Be specific. There are non-coding regions in
both ends. CDS does not englobe the whole sequence. (351486)
3. What protein does the sequence encode? From what species? AUX1-like amino acid
permease [Arabidopsis thaliana]
4. Which species (other than the source) has a protein that has the highest identity to this
one?
Arabidopsis lyrata.
5. Does the protein have a human homologue? Explain your answer.\
No, when a BLAST was performed limiting records matching entrez query: human
txid9606 [ORGN], no significant result was found thus concluding that the protein does
not have a human homologue.
.
>anonymous
ATGGAGAACGGTGAGAAAGCAGCTGAGACTGTCGTTGTTGGGAACTATGTTGAGATGGAGAAGGATGGTA
AAGCTTTAGACATCAAATCTAAGCTATCTGACATGTTTTGGCATGGTGGCTCTGCTTATGATGCTTGGTT
CAGCTGTGCTTCCAATCAGGTGGCACAAGTGCTGTTGACACTGCCATACTCGTTCTCACAGCTGGGGATG
CTCTCAGGGATTCTGTTTCAGCTCTTTTATGGCATCTTAGGTAGTTGGACTGCTTACCTCATCAGTATTC
TCTATGTTGAATACAGAACCAGAAAGGAAAGAGAGAAAGTTAACTTCAGAAACCATGTCATTCAGTGGTT
TGAGGTTCTTGATGGATTGCTTGGGAAGCATTGGAGGAATGTTGGTTTAGCCTTTAACTGCACCTTCCTT
CTCTTTGGATCTGTCATTCAACTCATAGCTTGTGCCAGCAACATATATTACATAAATGATAATCTGGACA
AGAGGACATGGACATACATATTTGGAGCATGTTGTGCAACCACAGTCTTTATTCCTTCCTTCCACAACTA
CAGGATCTGGTCTTTTCTTGGACTCTTGATGACCACTTACACTGCTTGGTACCTCACCATTGCCTCTATC
CTCCATGGACAGGTAGAAGGAGTGAAGCATTCAGGACCAAGCAAGCTGGTTTTATACTTCACAGGGGCCA
CAAACATTCTTTACACATTCGGTGGACATGCTGTTACTGTAGAGATAATGCATGCTATGTGGAAACCTCA
GAAGTTCAAGTCCATATACCTGTTTGCAACACTCTACGTGCTGACGCTAACGCTGCCTTCTGCGTCTGCG
GTTTATTGGGCGTTTGGTGATTTGCTTCTAAACCATTCAAACGCATTTGCTCTTCTCCCAAAGAATCTTT
ACCGTGACTTTGCAGTTGTGCTTATGCTCATCCATCAGTTCATCACCTTTGGTTTTGCTTGCACGCCACT
CTACTTTGTGTGGGAGAAGCTAATAGGGATGCATGAGTGCAGAAGCATGTGTAAACGAGCCGCTGCGAGG
CTCCCTGTCGTTATACCCATTTGGTTTCTTGCTATCATATTCCCGTTCTTTGGTCCCATTAACTCAACCG
TAGGATCTCTTCTCGTCAGCTTCACTGTCTACATCATCCCAGCACTAGCTCACATCTTCACCTTCCGCTC
ATCCGCCGCACGTGAAAACGCTGTGGAGCAGCCACCAAGGTTTCTAGGACGATGGACTGGGGCATTCACG

Manuel Castanon/ s3533799

City Campus

ATCAATGCCTTCATAGTTGTGTGGGTGTTCATTGTTGGATTCGGGTTCGGTGGTTGGGCTAGTATGATCA
ATTTTGTACACCAGATTGACACCTTTGGCCTCTTCACCAAATGCTACCAATGCCCACCACCGGTTATGGT
CTCACCTCCTCCAATCAGCCATCCTCACTTCAATCACACTCACGGCCTTTGAATTTTAAGCATCAAATCT
TTAGTACCGGCCCAAAAGTGTCCCTAGCTGCCACGAAAGTTGGCGTTTTTCCACGGATCAGAAAAAGACG
A

Question 4. (5 marks)
1. Which of the following three sequences are most similar? 1 and 2 are most similar
2. Do you think they are homologous? Yes
3. How did you determine that? By looking at the percent identity matrix.
4. Does this make sense? Yes
5. Why?
Because they are very similar sequences. You can also see in the phylogenetic tree that the
distance between them is not large.
>1
KESPAMKFERQHMDSGSTSSSNPTYCNQMMKRRNMTQGWCKPVNTFVHEPLADVQAICLQKNITCKNGQS
NCYQSSSSMHITDCRLTSGSKYPNCAYQTSQKERHIIVACEGNPYVPVHFDASVEVST
>2
RESPAMKFQRQHMDSGNSPGNNPNYCNQMMMRRKMTQGRCKPVNTFVHESLEDVKAVCSQKNVLCKNGRT
NCYESNSTMHITDCRQTGSSKYPNCAYKTSQKEKHIIVACEGNPYVPVHFDNSV
>3
ETPAEKFQRQHMDTEHSTASSSNYCNLMMKARDMTSGRCKPLNTFIHEPKSVVDAVCHQENVTCKNGRTN
CYKSNSRLSITNCRQTGASKYPNCQYETSNLNKQIIVACEGQYVPVHFDAYV

Part C
MSA (6 Marks)
Question 5. (3 marks)
Perform a multiple sequence alignment with the following sequences (questions are after the
list of sequences).
>SeqA
MSKKSFAKKVICTSMIAIQCAAVVPHVQAYALTNLEEGGYANHNNASSIKIFGYEDNEDLKAKIIQDPEF
IRNWANVAHSLGFGWCGGTANPNVGQGFEFKREVGAGGKVSYLLSARYNPNDPYASGYRAKDRLSMKISN
VRFVIDNDSIKLGTPKVKKLAPLNSASFDLINESKTESKLSKTFNYTTSKTVSKTDNFKFGEKIGVKTSF
KVGLEAIADSKVETSFEFNAEQGWSNTNSTTETKQESTTYTATVSPQTKKRLFLDVLGSQIDIPYEGKIY
MEYDIELMGFLRYTGNAREDHTEDRPTVKLKFGKNGMSAEEHLKDLYSHKNINGYSEWDWKWVDEKFGYL
FKNSYDALTSRKLGGIIKGSFTNINGTKIVIREGKEIPLPDKKRRGKRSVDSLDARLQNEGIRIENIETQ
DVPGFRLNSITYNDKKLILINNI
>SeqB
MISINRSLLATAVLSVLSTGVNAKVYPDQIVFDQLGEDVCRSGYRPLDRYEAEEQKSALVARMGTWQITG
LKGNWVIMGPGYHGEIKQSNSGSTFCYPNNDQSEIPNYSAKAVTEGSEIDVEYDLVNNRNDFVRPLSYLA
HNLGYAWVGGNNSQYVGEDMTIKRSGDSWVIQGNNSGSCDGYRCNEKTKITVDNFTYTVNDNNFWHGDVV
ESDRELVKTVYATARNRSDIAQQVVVDLKVDESTNWSKTNSYGFSESVQTENTFKWPLVGETKLTIKLEA
NQSFAETNGNSTSEQVTLQARPMVPANSELPIRVELYRSTISYPYRFNADISYDVEFNGFLRWGGNAWHS
HPDNRPYKAHTFTMGRSSNESADIRYQWDHRYIPGETKWWDWGWAIKEAGLSSMQYATGGSLRPFHSYVS
GDFNAESQFAGTIEIGQATPITNSVRSKRSVDSVNETTERIGDIEVTTNFNADELSDLGFEGAEMNISVV
E
>SeqC
AEPVYPDQLRLFSLGQEVCGDKYRPVTREEAQSVKSNIINMMGQWQISGLANGWVIMGPGYNGEIKPGSA
SNTWCYPVNPVTGEIPTLSALDIPDGDEVDVQWRLVHDSANFIKPTSYLAHYLGYAWVGGNHSQYVGEDM
DVTRDGDGWVIRGNNDGGCEGYRCGEKTAIKVSNFAYNLDPDSFKHGDVTQSDRQLVKTVVGWAINDSDT
PQSGYDVTLRYDTATNWSKTNTYDLSEKVTTKNKFKWPLVGETELSIEIAANQSWASQNGGSTTTSLSQS
VRPTVPARSKIPVKIELYKADISYPYEFKADVSYDLTLSGFLRWGGNAWYTHPDNRPNWNHTFVIGPYKD
KASSIRYQWDKRYIPGEVKWWDWNWTIQQNGLSTMQNNLARVLRPVRAGITGDFSAESQFAGNIEIGAPV
PLAADSKVRRTRSVDGAGQGLRLEIPLDAQELSGLGFSNVSLSVTPAANQ
>SeqD
MQKLKITGLSLIISGLLIAQAHAAEPVYPDQLRLFSLGQEVCGDKYRPITREEAQSVKSNIVNMMGQWQI
SGLANGWVIMGPGYNGEIKPGSASNTWCYPVNPVTGEIPTLSALDIPDGDEVDVQWRLVHDSANFIKPTS

Manuel Castanon/ s3533799

City Campus

YLAHYLGYAWVGGNHSQYVGEDMDVTRDGDGWVIRGNNDGGCEGYRCGEKTSIKVSNFAYNLDPDSFKHG
DVTQSDRQLVKTVVGWAINDSDTPQSGYDVTLRYDTATNWSKTNTYGLSEKVTTKNKFKWPLVGETELSI
EIAANQSWASQNGGSTTTSLSQSVRPTVPARSKIPVKIELYKADISYPYEFKADVSYDLTLSGFLRWGGN
AWYTHPDNRPNWNHTFVIGPYKDKASSIRYQWDKRYIPGEVKWWDWNWTIQQNGLSTMQNNLARVLRPVR
AGITGDFSAESQFAGNIEIGAPVPLAADGKAPRALSARRGEQGLRLEIPLDAQELSGLGFSNVSLSVTPA
ANQ
>SeqE
MRTKSSLSILTLSCLTALGTVSLLAEAAIPNTIPKLLMVEHAAPDLDSLKNDAINDPSFISSLFSLGHHL
GYAWAGGTASQYVGEDIEVRRESSNEYSLKARYNGNDPYASGYRANERLKVNLQNIHFVTNPQNLQLGSP
QVYDREAIYTAPVVIYNWGDTEDTGVATLNYDYTTSWAKTDNFSFSEKIGVTNKYEVGIPGIGGASSEIS
AEFSASQGWSETDGKSTTISSQAQYRAIMPPRSKRYISITLFKQKADVPYTSSLYMMYDIKYENFLKWGG
NAHINHPTNRPNFPYTFGGANANNLNGPEALVDQYLHQDINGYGVWDWPAAINSAQSKSGFEWQLANIVR
QHGAPISGKFTAIDSSQFNIDASESYPLTDEDIANRPKSAQKLGLSHNIQVAVGEFQNNDEDGLIKKLNI
SQSSEVMLNN

1. Paste your MSA into this document. Reduce the font size to 8 and change the font to
Courier so that it formats correctly.
CLUSTAL O(1.2.1) multiple sequence alignment
SeqB
SeqC
SeqD
SeqA
SeqE

--MISINRSLLATAVLSVLSTGVNAKVYPDQIVFDQLGEDVCRSGYRPLDRYEAEEQKSA
-----------------------AEPVYPDQLRLFSLGQEVCGDKYRPVTREEAQSVKSN
MQKLKITGLSLIISGLLIAQAHAAEPVYPDQLRLFSLGQEVCGDKYRPITREEAQSVKSN
MSKKSFAKKVICTSMIA----------------------IQC-----------------MR------TKSSLSILT----------------------LSC-----------------*

SeqB
SeqC
SeqD
SeqA
SeqE

LVARMGTWQITGLKGNWVIMGPGYHGEIKQSNSGSTFCYPNND-QSEIPNYSAKAVTEGS
IINMMGQWQISGLANGWVIMGPGYNGEIKPGSASNTWCYPVNPVTGEIPTLSALDIPDGD
IVNMMGQWQISGLANGWVIMGPGYNGEIKPGSASNTWCYPVNPVTGEIPTLSALDIPDGD
-AAVVPHVQAYALTN---LEEG---------------GYA-----NHNNASSIKIFGYED
-LTALGTVSLLA--------EA---------------AIP-----NTIPKLLMVEHAAPD
:
. .
.

SeqB
SeqC
SeqD
SeqA
SeqE

EIDVEYDLVNNRNDFVRPLSYLAHNLGYAWVGGNNSQYVGEDMTIKRSGDS-----WVIEVDVQWRLVHDSANFIKPTSYLAHYLGYAWVGGNHSQYVGEDMDVTRDGDG-----WVIEVDVQWRLVHDSANFIKPTSYLAHYLGYAWVGGNHSQYVGEDMDVTRDGDG-----WVINEDLKAK-IIQDPEFIRNWANVAHSLGFGWCGGTANPNVGQGFEFKREVGAGGKVSYLLS
LDSLKND-AINDPSFISSLFSLGHHLGYAWAGGTASQYVGEDIEVRRES----SNEYSLK
.::
: .*:
:.* **:.* **. . **: : . *.
: :

SeqB
SeqC
SeqD
SeqA
SeqE

--QGNNSGSCDGYRCNEKTKITVDNFTYTVNDNNFWHGDVVESDRELVKTVYATARNRSD
--RGNNDGGCEGYRCGEKTAIKVSNFAYNLDPDSFKHGDVTQSDRQLVKTVVGWAINDSD
--RGNNDGGCEGYRCGEKTSIKVSNFAYNLDPDSFKHGDVTQSDRQLVKTVVGWAINDSD
ARYNPNDPYASGYRAKDRLSMKISNVRFVIDNDSIKLGTPKVKKLAPLNSASFDLINESK
ARYNGNDPYASGYRANERLKVNLQNIHFVTNPQNLQLGSPQVYDREAIYTAPVVIYNWGD
*. ..***. :: :.:.*. : : :.: *
.
: :.
* ..

SeqB
SeqC
SeqD
SeqA
SeqE

IAQQ-VVVDLKVDESTNWSKTNSYGFSESVQTENTFKWPLVGE----TKLTIKLEANQSF
TPQSGYDVTLRYDTATNWSKTNTYDLSEKVTTKNKFKWPLVGE----TELSIEIAANQSW
TPQSGYDVTLRYDTATNWSKTNTYGLSEKVTTKNKFKWPLVGE----TELSIEIAANQSW
TESK-LSKTFNYTTSKTVSKTDNFKFGEKIGVKTSFKVGLEAIADSKVETSFEFNAEQGW
TEDT-GVATLNYDYTTSWAKTDNFSFSEKIGVTNKYEVGIPGIGGASSEISAEFSASQGW
.
:.
:.. :**:.: :.*.: . ..:: : .
: : :: *.*.:

SeqB
SeqC
SeqD
SeqA
SeqE

AETNGNSTSEQVTLQARPMVPANSELPIRVELYRSTISYPYRFNADISYDVEFNGFLRWG
ASQNGGSTTTSLSQSVRPTVPARSKIPVKIELYKADISYPYEFKADVSYDLTLSGFLRWG
ASQNGGSTTTSLSQSVRPTVPARSKIPVKIELYKADISYPYEFKADVSYDLTLSGFLRWG
SNTNSTTETKQESTTYTATVSPQTKKRLFLDVLGSQIDIPYEGKIYMEYDIELMGFLRYT
SETDGKSTTISSQAQYRAIMPPRSKRYISITLFKQKADVPYTSSLYMMYDIKYENFLKWG
:. :. : : .
: .:: : : :
. ** . : **:
**::

SeqB
SeqC
SeqD
SeqA
SeqE

GNAWHSHPDNRPYKAHTFTMGRSSN--ESADIRYQWDHRYIPGETKWWDWGWAIKEA--GNAWYTHPDNRPNWNHTFVIGPYKD--KASSIRYQWDKRYIPGEVKWWDWNWTIQQN--GNAWYTHPDNRPNWNHTFVIGPYKD--KASSIRYQWDKRYIPGEVKWWDWNWTIQQN--GNAREDHTEDRPTVKLKFGKN---GMSAEEHLKDLYSHKNINGY-SEWDWKWVDEK---GNAHINHPTNRPNFPYTFGGANANNLNGPEALVDQYLHQDINGY-GVWDWPAAINSAQSK
***
* :**
.*
:
: :: * *
*** . :.

SeqB
SeqC
SeqD

-GLSSMQYATGGSLRPFHSYVSGDFNAESQFAGTIEIGQATPITNSVRSKRSVDSVNETT
-GLSTMQNNLARVLRPVRAGITGDFSAESQFAGNIEIGAPVPLAADSKVRRTRSVDG---GLSTMQNNLARVLRPVRAGITGDFSAESQFAGNIEIGAPVPLAADGKAPRALSARR---

Manuel Castanon/ s3533799


SeqA
SeqE

FGYLFKNSYDALTSRKLGGIIKGSFTNINGTKIVIREGKEIPLPDKKRRGKRSVDSLDAR
SGFEWQL---ANIVRQHGAPISGKFTAIDSSQFNIDASESYPLTDEDIANRPKSAQKLGL
*
.
*
. :.*.*. .
* .
*: .
:

SeqB
SeqC
SeqD
SeqA
SeqE

ERIGDIEVTTNFNADELSDLGFEGAEMNISVVE-------AGQGLRLEIPLDAQELSGLGFSNVSLSVTPAANQ-----GEQGLRLEIPLDAQELSGLGFSNVSLSVTPAANQ----LQNEGIRIE-NIETQDVPGFRL--NSITYNDKKLILINNI
SHNIQVAVG-EFQNNDEDGLIK---KLNISQSSEVMLNN: :
:: ::
:
.:. .

City Campus

2. Which two sequences have the most sequence similarity? Which two have the least? B
and C are most similar. The least similar are A and B
3. From the phylogenetic tree, calculate the distance from SeqA to SeqB. Show the formula
you used. Is this greater or less than the distance of SeqA to SeqC?
Seq A to B: 0.37242+0.16297+0.26068= 0.79607
Seq A to C: SeqA 0.37242+0.16297+0.20697+0.01342=0.75578
The distance between seq A to seqC is less than the distance between seqA and seqB

Question 6. (3 marks)
Using text query, search the protein sequence database for ribosomal protein S16. Download
entries 3, 4 and 5 in FASTA format, paste into your document.
>3
1 mavklklkrm gkirtpqyri ivadartkrd gkaieeigky hpkehpslie vdservqywl
61 svgaqpsdav kvllrksgdw qkfkglpapa plqvaeakdp eakeaafqaa
lkelvlpgae
121 gkgkgrksek kadkadkpae taetteteks egea

>4
1 malrirlarf grshhpiyri vvmdakspre gkyvdilgty dpirkdlqvk eekvkdwlsk
61 gaeitdraks llrsakil

>5
1 mairirlmrk grrnrpfyrv vaaeasaprd gkflevlgyy dplkdpyefk vdpekvkkwl
61 drgakptetv rallrrsgil n

Run a multiple sequence alignment.


1. Paste the phylogenetic tree (real) here. Use capture programs (Mac: Grab, Cmd + Shift +
3, PC: Snipping Tool, or Cntrol + screen shot)

2. Do you think all these sequences are homologous?


After doing a BLASTp with the sequences that had less similarity it is possible that these
sequences are homologous. I do think they are.
3. Which sequence has an insertion?
Sequence 3 has an insertion