Está en la página 1de 4

Laurence Herbert

Web Based Bioinformatics Data Management System

Prototyping, Analysis of Software & Implementation Decisions

With a web based project, analysis of what programming languages, tools and
platforms to use needs to be performed to best fulfil what is required of the project.
Many alternative database systems, programming languages can be chosen and the
most relevant and efficient need to be chosen.

The objective of the project is to implement an online protein management system


which will require a front-end for user interaction, a middle-level to process data, and
a backend in order to store the processed data. This document will discuss what
languages and tools could be used in each level with the possible drawbacks and
advantages.

Front End

The front-end of the system is where the user interacts with the website. As with all
websites on the internet the layout of a website is designed using a language called
HTML which uses a series of tags in order to render the website. Other languages
that can interact with HTML include Cascading Style Sheets (CSS) and JavaScript.
When the user visits the front-end (the website) a series of text boxes and buttons
will be displayed so a PDB file or a list of protein chains can be input manually. To
achieve this, the languages mentioned above HTML, CSS and JavaScript have been
chosen. Below describes what each of these programming achieves:

 HTML/XHTML – As said above this language is used to render the website


and provide the addition of text boxes, buttons and text.

 CSS - This is used in conjunction with HTML to describe the formatting of the
website such as layout, tables and fonts.

 JavaScript – Allows for the ability of loading bars and other interactive parts of
the website. For example, JavaScript will be used to display a loading bar so
the user can see the progress of the PDB file being uploaded to the server &
the amount of time it takes to perform calculations on the proteins.

The reason for choosing these three programming languages are there are no
alternative languages that can be used furthermore members in our group have past
experience in each of these languages.

Data Processing (middle level)

Once the user has entered the protein information it needs to be processed. To do
this a number of programming languages have been chosen:

 PHP – A web based server side scripting language. This language is widely
used throughout the Internet to allow for dynamic websites, data
processing & database management.

There are several major advantages that PHP has over other languages such as
ASP or Perl.
Laurence Herbert

Advantages of PHP

 PHP is open source and has a large community so coding support and
documentation is vast and well written.
 PHP is used as an Apache module, written in C, so is generally fast.
 Database interfacing such as MySQL, Oracle and MS SQL.
 Can be written into HTML.
 Ease of understanding & prior knowledge. Members of the group already
have past experience with PHP.
 Runs on multiple platforms such as UNIX, Windows and Mac.
 Server support. PHP has already been installed onto the server that the
project is used to develop on.

There are certain disadvantages such as the language being loosely typed (the type
of variable does not have to be explicitly stated) which can make code harder to
debug. It was decided that with the advantages above and the prior knowledge
members in the group already contain that PHP would be chosen to handle data
processing.

Another language used to perform calculations on the proteins is Perl. This is used to
compute statistics about the protein such as the Secondary Structure, Solvent
Accessibility and others. These scripts have already been provided for us and are
written in Perl. Using PHP these scripts can be executed and the results returned
back to the browser and then processed. This may decrease the efficiency of code
execution but the Perl scripts provided are complex and re-writing them into a
language such as PHP would be outside the scope of the project.

Backend

Once the calculations have been performed on the protein the results returned need
to be stored in a database. We have decided the use MySQL in order to do this.

Advantages of MySQL

 Ease of integration with PHP.


 Open source and free with good documentation.
 Management tools such as phpmyadmin allow for easy management of the
database.
 Efficient and fast and used in many large websites as a backend database.
 The ability to “dump” the database i.e. provide a backup. This will generate a
file that contains all the information inside the database and the SQL that
makes up the database. A script could be used to dump the database in at a
certain time or day, so for example every 24 hours.
 Prior knowledge and use by group members.

Another language that could be instead of MySQL is HDF5 which is a way of storing
large amount of numerical data. HDF5 has API support for languages such as C++,
C and Java but after research no support/integration with PHP. Although HDF5
seems to be a language that is good for storing large amount of data it appears that it
has limited support with working with web languages such as PHP. Furthermore no
group members have had experience using this library. It could be worthwhile doing
more research into HDF5 but for initial prototyping MySQL has decided to be chosen.
Laurence Herbert

Prototyping

Prototype 1.5

Some initial prototyping has been performed using the chosen languages above. The
user has the ability to either upload a PDB file or provide a list of proteins in which to
process. If the user chose to upload a PDB file then PHP handles the actual
uploading of the file onto the server and into a specific folder.

The user now has the choice of selecting what information they wish to derive about
the protein. PHP executes the specific Perl script required for that calculation, the
Perl script performs the calculation and the result is sent back to an array in PHP.
PHP then splits the data, for example with the Secondary Structure PHP loops
through the array and splits the data into the Amino Acid type and the Secondary
Structure state.

Once the data has been split it can be added into the database. A MySQL “insert”
statement is used to insert the data into the database.

Prototype 1.7

Modifications were made to the previous prototype to move it towards the use
required by the specification. The first user interface design was added to the front
page, giving the user the choice to either upload a text file containing a list of
proteins, or to manually enter them.

Once the user has taken one of these two steps the PDB files are copied from one
source to a local folder for use. Then the user is presented with a confirmation of
copy or a list of missing files and the option to select what kind of data they want to
load into the database. Once this is submitted a function is called, which used to be a
separate PHP file, and the data is entered into the database.

Our main objective with the first prototypes was to implement a way for the user to
interact with the system, upload a protein and for information about that protein to be
derived and stored inside a database, which has been achieved. There are some
technical issues and additions that need to be addressed with the prototype:

Technical issues

The main issue is efficiency. For example when calculating the contact map of a
protein the database table can fill to up to 40,000 records and this is only for one
protein. One of the objectives of the system is for the ability to analyse multiple
proteins and if one protein can take up to a minute or longer to process then the
whole process becomes very slow. There are several optimisation techniques that
could be used:

 Benchmark queries to see which queries are running the slowest


 Use of the “EXPLAIN” statement that provides information about how the
MySQL optimizer processes queries.
 Use of indexes so MySQL does not have to scan through the table in its
entirety. It would seem that adding indexes to the database would vastly
speed up processing time.
Laurence Herbert

Furthermore the PHP code should employ the use of Model View Controller
architecture to make the code more modular and easier to understand and edit. This
will be done in the next prototype.

También podría gustarte