Está en la página 1de 10

Assignment 2 Report

Advance Database System

Page 1/10

Introduction
HTML (HyperText Markup Language), the web in the internet often treated as a good source of knowledge. Nowadays, if a person do not know a specific term or other knowledge, a very common solution is to Google it. There are billions of web pages on any different aspect and it is still increasing rapidly [1]. HTML was designed for humans to read but not program codes. In order to let computer programs process those data, an idea to store up the information in HTML format files to database. None of those tasks could be completed easily mainly due to the fact that human readable items are usually in free format. In this project, the objective is to write a simple wrapper that create a database with tables from the XML (Extended Markup Language) schema then extract specific useful information in HTML files based on the instruction of the XML files, followed by store them into the SQL database.

Principle and potential issue


It is not a new idea but not very common as there are many problems that we need to solve. Two obvious major areas we need to overcome, one of them is the compatibility between SQL and XML schema, the other is the issues between SQL and HTML. XML schema to SQL issues XML files are properly nested and design with simplicity, but still problems need to be solved. XML are very flexible and the structure is depends on the XML designer. There are virtually unlimited ways to express the same data. Data can write in elements but also as attributes. However, this flexibility becomes the drawback. Since XML allows creator to invent their own tags and their own style, a consensus with program designer is a must in order to process different XML correctly, in other words the structure must be well known In case of a XML schema is trying to express a database structure with tables, there are some hidden problems when implement them into SQL database. For example, it is possible to define a foreign key reference to a table at lower position for XML. In SQL, if the reference column is not defined, the SQL statement would fail. More to the point, database, table and column can possibly express in the same way e.g. <xs:element name=???>. Some identifier must be used to distinguish the role of each <xs:element name=???>.
Page 2/10

HTML to SQL issues and possible solution There are tables, texts with or without description tag, pictures and even other elements like movie by using plug-ins. The position of data is not specific and able to make changes. Picture on the top can moved to the bottom, position of table and be before or after the texts. All these design depends on the web designer. However, all of them are valid HTML. If a structure of web page is volatile, almost there is no fully automatic way, unless there are some given unchanged rules. Table Tables in web are usually easier to identify. Since there are some clear identifiers of table tag, <table></table>. In addition, table header cells are similar to the SQL database table column while the standard cells act like the data values in SQL. Also, it is not hard to dig them out by searching the tags <th></th> and <td></td> respectively. However, there are still some problems for it. Such as the displayed header cells may not truly reflect the actual column names in database. The following scenario helps to further explain. Scenario: To store the student ID and Score in figure 1.1 (HTML) to database table in figure 1.2

Web Table ClassStudent IDStudent NameScoreA1000001John80A10 00002Peter60B1000003Mary75Fi gure 1.1

SQL Table StdIdMark10000 01801000002601 00000375Figure 1.2

Problem: 1. Need to map the correct header cells to the SQL table column. (i.e. Student Name -> StdId & Score -> Mark) 2. Some unused data must be throw away (i.e. Class and Student Name in the Web) Possible Solutions: If it is a routine job and the structure will not change, it is possible to hard code the
Page 3/10

program, however, obviously not the best solution. A better approach is to teach the program how to retrieve with some rules, possibly an external XML files. This will further explain below Text Of course, if only store all text in database would not cause a big problem, but retrieve part of the data from text is a challenge. However, if a HTML file is well developed and with some useful description and identifier, it becomes possible. A good example is that to search the specific key words between the definition list <dl>, <dt> and <dd> tag. But the description must be well designed or the structure will and always remains unchanged, otherwise the program will have no ways to achieve the goal correctly Other problems Unlike XML, HTML is not a properly nested, there are elements without opening tag or end tag. A very common example are <br> and <br/>. [2] Both of them are valid and using in many web pages. Moreover, there are some different expressions to the same result cause the task to become more difficult, like the table cells can write in the same line or separate into different lines. Although it could be annoying, it is not a hard task. A prerequisite can be done to separate the single line to multiple lines or vice versa, to collect the multiple lines table cells into one single line. Sample codes: <tr> <td>Cell 1<td><td>Cell 2<td><td>Cell 3<td> </tr> Equivalent codes: <tr> <td>Cell 1<td> <td>Cell 2<td> <td>Cell 3<td> </tr>

Page 4/10

Program Design Basic Concept


There is no standard or best method to accomplish the data wrapping from HTML. In the project, with the help of the data stored in XML, it is possible to locate some specific information in different parts of predefined format web page. There are two phases, the first one is to establish the database from XML schema, also the prerequisite of the core, the latter part. The next phase of the wrapper is to read through the XML files to obtain the rules. Then according to those rules to identify the target data, followed by SQL database storage. The two parts in our project: 1. Create a database with corresponding base on the XML schema structure 2. Read the XML files to specific the needed data in HTML

Figure 1.3 The Work Flow of the Wrapper.

Data Extraction Principle in wrapper


The most challenging part for this project is to extract information in the nonstructural data in HTML into structural data. In the SQL database, it requires a strict data format, length for every table column. On the other hand, the HTML files are
Page 5/10

designed to be read by human. Length of description, format can be varies. Also, not every lines of the HTML are useful. For example, the head line, the color, the font size, CSS etc. They are designed for human to read in a more comfortable way and enhanced appearance but it is neither significant nor valuable for database. As mentioned above, the easiest one should be the table in web pages. Though it is similar, if there is some non-SQL content in the table, we have to eliminate them from entering the SQL table according to the XML instruction. The XML defined all needed field in SQL, so if we compare the information from XML and the HTML table, it is not hard to get the needed content to store in array.

SQL Database Table A


3

Clm1Clm2Clm3
Data1 Data2 Data3

XML Instruction

SQL Clm1Clm2Clm3HTMLColumn1Column2Column3

1
Data1

Data2

Data3

HTML Table content

Column1Column2Column3Column4Data1Data2Data3Data4 

Column4 is ignored due to no match with the XML instruction Similarly we can apply the techniques into the Id tag of the HTML. However, some unwanted information need to be deleted like the Amount: from the Amount: 10000 As it is specified in the XML as identifier, it is not hard to remove the unwanted information by replacing the Amount: with before write it in the SQL database.

Feature of the Wrapper


According to the .XSD files instruction to create database with corresponding tables. Read the XML files to get the rules and follow the rules to obtain useful
Page 6/10

information in HTML files, followed by entering data into the database. The ability to handle different kinds of data format in HTML including table, specific text as stated by the XML It could handle HTML tables with unwanted column. A Graphical User Interface that is user friendly and easy to use Display the SQL statement used for manual check if necessary Will return feedback while the program running is finished. Able to check part of the errors from the XML files before running the program to ensure data integrity. Including the Missing file tag or missing essential files. E.g. <htmlFile>, </htmlFile>, <colMatch>, <byStr> etc. Error message is available to users and point out the problem of XML so that user can check that area more quickly

Limitation and improvement


Unfortunately, due to the limited of time and resources, this version of wrapper is not perfectly designed and has some limitations. Limitation If the same .htm file is execute, the program will crash as there are duplicated key field in the primary key. Improvement It could be improved by searching the records in the database, if exist throw an exception to avoid the situation. Limitation The program will automatically retrieve all rows in table to database but not able to select partial rows. Improvement An UI (User Interface) in the wrapper can be enhanced to let user to confirm if all records is needed. Limitation Not able to handle problematic HTML files. If the file is invalid, the program is not able to verify, thus a very high chance to crash.
Page 7/10

Improvement Before executing the core part of the wrapper, use one module to validate the HTML file. Limitation The error checking of the XML files does not return the error lines due to resources limitation, e.g. time. Improvement Feature can be added by a line counter so that the error would be much more precise to users.

Page 8/10

Reflection
There are several major gains through this project. First of all, no doubt, it is a difficult project with many major and minor problems needed to be solved. In general, we must break down the project into smaller tasks and resolve the problems one by one. Actually this technique is useful and theoretically can apply to any other project and problem resolving task. Also, in terms of technical, we have done many researches how to use the c# and thus learn more ways to use the codes to accomplish the tasks. Though, we find that our programs are not the most optimum way, at least not the most efficient one. There are still rooms for improvement. Nevertheless, it is a good direction to have known where and how to improve. We had no idea about the XML before the lectures. It is completely new things to us. A basic concept and functions were introduced, throughout this programming project, it have deepened our knowledge of XML. In addition, we are confident to create some XML documents by ourselves which is useful to both work and study. Up to this moment, the project was the most difficult and challenging compare to the project in the past semester. It evolved multiple areas that we have to familiar with to finish the tasks. We thought it was mission impossible, but during the hardworking it becomes possible. The knowledge is accumulated and somehow there are common points within the few aspects. The structure of XML is a good example, it was very new to us, but when we take deeper study we found it somehow quite similar to the HTML. It was a great achievement for us to complete the project. Last but not the least, it is a good opportunity to form a study group to achieve the same goals and to solve problems one by one. It is a challenging task but no doubt a very practical way to apply we have learnt through the process. Every team member is able to share their knowledge and contribute a lot. Also it could be a very good start for a bigger project in the future, which may possibly base on this wrapper to aim at a more advance version with better functions and support more different kinds of HTML as well as XML format.

Page 9/10

Conclusion
It is possible and practical to retrieve the data in HTML to database with some constraint. There are minor issues that have to take care of. The project is successful but still there are some places that could be improved.

Reference
[1] U.S. Census Bureau. Internet Access and Usage: 2010 Internet: http://www.census.gov/compendia/statab/2012/tables/12s1157.xls ,2012 [Aug. 10, 2012] [2] Refsnes Data. HTML <br> Tag Internet: http://www.w3schools.com/tags/tag_br.asp ,2012 [Aug. 12, 2012]

Page 10/10

También podría gustarte