Está en la página 1de 4

An Effective Wrapper Architecture to Heterogeneous Data Source

Hongzhi Wang Jianzhong Li Zhenying He Department of Computer Science and Technology, Harbin Institute of Technology whongzhi@0451.com lijz@banner.hl.cninfo.net hzy_hit_cn@sina.com Abstract
In this paper, we focus on the problem in information integration system of obtaining data from heterogeneous data source accurately and effectively. XML is used as data exchange format of the wrapper. We design the wrapper architecture based on the conversion and management of the views as the bridge from global schema to local schema of various data sources. Our wrapper has two main subsystems, data extract subsystem and query executor subsystem. The former is for loading data for the cache in mediator when changes more than umbral threshold are detected, and the latter is for answering the query from the mediator. The architecture adapts to the data and schema change of the data source and could answer the query of mediator effectively. Considering the wrapper may run in the environment without control, the process in wrapper should be simple enough. The storage in wrapper itself should be as small as possible and the storage of data source could be used. The detail of modules query rewrite, view management, query merge, result wrap and schema change detect are discussed. The behavior of wrapper during the query process in wrapper is discussed with a running example. The security strategy, especial in the distance that the wrapper runs in autonomic data source, is also introduced in this paper. information integration system is that we use a XML warehouse as cache holding most frequently queried information. And our system integrates information from relational DBMS, object-relation DBMS, object-orient DBMS and unstructured data source, especially the web data source. The schema of the former three kinds of data source could be represented as semi-structured data in tree structure. The nodes set in local schema is L, and the nodes set in global schema is G. The mapping between global schema and local schema is mapping function f: The query to global schema is at first converted to the query to data source s local schema by the stored mapping function. Wrapper submits the query to data source, wraps the data returned from data source and sends it to mediator. Finally mediator integrates all the information returned from difference data sources into single result set. The query process to unstructured web information is quite different. The component converting query to global schema to keywords or other form to input to the form of web set is necessary, so is the component of extracting unstructured web pages into semi-structured information. The component is called query proxy. The wrapper to web site is to input legal query to the form on web site and return the links or web pages returned to query proxy There are many works of wrapper in various information integration systems. And many algorithms for schema conversion, view selection and query rewrite in wrapper were presented in former works. Due to the space, a complete version of this paper with survey of related works is in [9]. The Architecture of our system is shown in fig.1. In this paper, we will Mediator XML focus on the warehouse instance of Wrapper Proxy wrappers for Wrapper Crawler structured and Wrapper Wrapper semistructured OODB RDB data sources while Search Search the wrapper to engine engine unstructured data source is left as the WEB problem of Fig.1The Architecture of Information Integration metasearch and information integration. Our contributes in this paper are:

1.Introduction
Modern information system needs to query not only the local, homogeneous data source, but also heterogeneous data source in distribute environment even unknown data source hidden in web. Getting information from heterogeneous data source to answer the query on the global schema is the problem of information integration. There are three architecture of information integration system: federal database, data warehouse and mediator. [1] Now, mediator [2] based information integration system is used most with two main parts, mediator and wrapper. Wrapper orients to data source and the task of it includes processing the query from the mediator and transmitting it to the data source, extracting data from data source for the cache of mediator and detecting the schema change of data source for mediator to maintain the data conversion rules from global schema to local scheme in data source. Our information integration system is the mediator system with XML as data exchange and representation format. The difference from our system to other

Proceedings of the17 th International Conference on Advanced Information Networking and Applications (AINA03) 0-7695-1906-7/03 $17.00 2003 IEEE

A complete architecture of wrapper adapting to mediator with XML warehouse as cache is accorded. The behavior of the wrapper is introduced in detail. The security strategy is introduced. This paper is organized as follows. Section 2 overviews wrapper architecture, and in this section the function of each component and the process is introduced. In section 3, basic data structures and behavior of each module are described. The security strategy is introduced in section 4. Related works are presented in section 5.

Communication interface and traffic cope query Encoded answer

Query translater XML wrapper Query fit view Cleaned data View manager Query Answer cleaner SQL generator SQL statement for the DBMS Materialiazed view choose Query merger Query generator Wrappered data Query for the view Data wrapper Result Backend DBMS Data answerSchema the query change Schema change detector

2.Overview of Wrapper Architecture


The wrapper has three subsystems: Data extract subsystem, Query executor subsystem and schema change detector. View is an important representation of en el detecto information in the wrapper. Views built on the data source de cambio de fuente is in the part of local schema related to global schema. pensaria que yo no Since the query from mediator is to the whole local tendria schema of data source, the goal of building view is to q hacerlo, rewrite the query to the whole local schema into the porque mi fuente es optimum query to the data source. And another goal of unica? building view is to choose the view to materialize and store in the XML warehouse. The view maintained in wrapper is defined as local view. The shield of the wrapper is the communication interface and traffic cope with the function of sending and receiving data exchange information. This part parsers the command from mediator and determine which subsystem is to process the command. The function of schema change detector is to monitor and remember the change of the metadata related to the schema of data source. When change of the schema of data source occurs and the changed schema is related to the global schema, schema change detector will retrieve the change to the mediator. Query executor subsystem servers for the query from the mediator directly. The query generated by the mediator is translated into the query to local view in query translator. Based on the information of view in view manager, the query is optimized. And then in SQL generator, the SQL or other query language in the form of data sources query language is generated and sent to the data source. The data retrieved by data source is separated and should be compose into a whole with some repetitive data to be eliminated in answer cleaner. The data to be sent to mediator should be wrapped into the form of XML and compressed when necessary. Data extract subsystem is to extract the data of the view to be materialized for the XML warehouse. Which view is to be materialized should be determined. Part of this problem is determined by mediator and part is determined by wrapper based on the change of related data. The view to be materialized is merged and optimized in query

Query executor Data extract subsystem

Fig.2 The Wrapper Architecture

merger and query to the data source is generated in query generator. Returned result is wrapped in data wrapper. The wrapper is light with only the buffer in traffic cope the pointer of the views to data source and about all the information of view stored into the data source.

3.The Detail of Modules


In this section we will introduce the structure and behavior of the modules in our wrapper in detail. We suppose that the conversion from global view to local view in mediator and the constitution of the relation between global schema and local one is transparent to the wrapper. That is to say, the uncertain and approximate operation in mediator has no effect on the wrapper. The wrapper is built for data source, so the implementation details of different data sources are different, but the architecture and the basic algorithms are same. The building of wrapper could be completely manual, semi-automatic with the method of filling some optional blank in the code or automatic to analysis metadata and behavior of the data source. Since for a few data sources it is not economic to implement automatic wrapper generator, we implement the wrapper manually for different type of data sources with the same architecture and basic algorithms. In this chapter, we will use a single example in fig.3. A is the global schema. Data source A is relational database

Proceedings of the17 th International Conference on Advanced Information Networking and Applications (AINA03) 0-7695-1906-7/03 $17.00 2003 IEEE

with the schema in fig.b. Data source B is an XML database with its schema as fig.c.

3.1Query Rewrite
sfw?

In this step, the plan of query process will be generated. In order to discuss the problem relating to the query or view, the query is defined as SFW form, the object in which has path of it in the local schema. Since the query from mediator is related to the whole local schema, the first thing to do is to decompose the query into separate parts without any field relations between them. There are two methods to compound the result of the separated query, union different data item and Cartesian product. But the semantics of Cartesian product is hard to define and Cartesian product will produce very large result. Without special request, Union of the result is preferred. If the schemas of the result separated queries are the same, the query sentence could be connected with UNION-like operation. Otherwise, on the above of plan is the operation of APPEND with the meaning of wrapping and sending the result set of the queries one by one. The query generated by this step is sent to the data source. In our example, for data source A, if there is no other views over the database, the query from the mediator to list all kinds of books should be separated into the query to tables of different kind of books. The separated queries will be sent to data source respectively. This process is shown in fig.4

statistics of restriction is that if restriction ra contains rb, the usage times of ra and rb are both increased. The node with visiting time above the threshold and its parents in local schema is as the schema in Vq. The conjunction of the disjunction of the restriction with the usage time above a threshold is the condition of Vq.

3.3 Query Merge


If two or more views chosen to materialize could be answered by one query, they need to be merged before querying to data source in order to decrease the storage space. There are three economical instances for merging queries: a) One query qo contains a query set Q, on restriction or on field. In this instance, qo will be executed and the queries in Q is just to filtering the result of qo. b) The queries in Q are related to one schema with just different condition. In this instance, the union restriction r of queries in Q is as the restriction of the merged query q and the union field set of queries in Q is as the field set of q. And the result of any qi with restriction ri and field set fi in Q is to filter the result of q with ri and projection the result in fi. c) The queries in Q is the join of the same schemas with different join conditions. This condition is quite like instance b. The union of Q is the query to data source and the result is filtered by different join condition. An example of query merge of data source A is in fig.5.
Q1: select name, price, author from math where price<100 Query Q2: select name, price from math where price<150 Q3: select name, price, author from math where price<100 and author=Shining Result for Q1 Result for Q2 Filter with price<100 and Result for Q3 Filter with price<100 Project on name, price result Data source merge Select name, price, author from math where price<150 query

3.2 View Management


The view is the predefined or even pre-executed query to accelerate the execution of the query. The view definition in our wrapper is stored in the data source with only the pointer for the wrapper to manage. In query execution subsystem, the query could be optimized with the views. If the query is contained in the view, the query is trimmed with the view, especially when there is aggregation in the query. The strategy of choosing materialized views is based on the statistics of the query, the updating data and the pre-decision of the mediator. The materialized view mediator chosen is Vm, the schema updating most is Vu, and the view the most queried is Vq. The materialized view is Vm (Vu Vq ) . Vq is statistic as follows: The visiting times of every node is remembered, so is the usage times of restriction to the node in queries. The

author=Shining Fig.5 Example for Query Merging

3.4 Result Wrap


The result of queries is returned to this part to wrap in the form of XML. The result should be cleaned at first. Duplicate data is eliminated in this step. The schema of the result known, the result is to be wrapped fitting into the schema. In order to shrink the data to communicate, the tag of XML is the ID of tag but not the string. If there is command of compress, the result should be compressed to speedup the data sending. To compress data in the same schema, the method in [3] is effective but to be improved in that the edit distance is computed just between the last data item and the next data item. In this instance, the first data item and the s between every two near data items are sent to the mediator as the result. An

tener en cuenta lo que se hace cuando se actualiza la fuente

Proceedings of the17 th International Conference on Advanced Information Networking and Applications (AINA03) 0-7695-1906-7/03 $17.00 2003 IEEE

example of returning the result with just three XML data items from data source B is given in fig.6. The detail of compression for communication is in [5]. Encryption of result may be required to keep the security for the data sending. The cryptographic key is the stipulation of the mediator and wrapper. The format of the result to return to mediator depends on the protocol between mediator and wrapper such as INEXP [4].

polynomial and rational number is feasible. The method could be implemented by storing a cryptographic key computed by a secret rational function.

5. Conclusion and Future Work


In this paper, we present the architecture of wrapper in the mediator-based information integration system. Our wrapper architecture has three parts for different usages. This architecture adapts to the directly query from mediator and the data exaction for the XML warehouse as the cache of mediator. The information process course, the detail of wrappers behavior and the protocol between mediator and wrapper is introduced. This architecture of wrapper adapts to different kinds of data source with only a little amend. How to generate the wrapper for the data source automatically is an important problem when there are lots of data sources to be queried. Our wrapper is separated from the data source and the query plan of data source has no effective on the query process of the wrapper. If the query plan of data source is considered, the query process of wrapper may be optimized. We will research of these problems in our future work.

3.5 Schema Change Detect


Schema change detector keeps notice on the input of the data source. It parses the input and monitors the command that could change the schema of data source. If the command to data source may change the schema, the operation is analyzed in detail and the change in the part local schema related to the global schema is found and recorded as the script. When the mediator needs to update the schema, the wrapper will execute the script and generate new local schema. It should be noticed that the view is managed by mediator, so when the change of schema effects the view, the change should be messaged to mediator timely. The creation of new schema is sent to the mediator at once for the mediator to determine if the new schema is useful.

References
[1] Hector Garcia-Molina et.al. Database System Implementation. Prentice Hall. 2000. [2] G.Wirderhold. Mediators in the Achitecture of Future Inofrmation Systems. IEEE Computer, 25:38-49. [3] Hongzhi Wang, Jianzhong Li, Zhenying He. A Storage Strategy for Compress XML Warehouse. NDBC2002. [4] Hongzhi Wang, Jianzhong Li, Zhenying He. INEXP: Information Exchange Protocol for Interoperability. ICADL2002. [5] Hongzhi Wang, Jianzhong Li, Zhenying He. Compress Communication and Query Process in the Distributed Information Integration System Based on XML. DPCS2002. [6] Christian F. Tschudin. Mobil agent security. Intelligent Information Agents. Springer, Germany 1999 [7] Joan Feigcnbaum, Peter Lee. Trust management and proof-carrying code in secure mobile code applications. DARPA Workshop on Foundations for Secure Mobile Code. March 1997 26. [8] Tomas Sanderand Christian F. Tschudin. Protecting mobile agents against malicious hosts. Mobile Agents and Security, Lecture Notes in Computer Science 1419. Springer, Berlin, 1998 [9] Hongzhi Wang, Jianzhong Li, Zhenying He. An Effective Wrapper Architecture to Heterogeneous Data Source. Technical Report of Harbin Institute of Technology.

4. The Security Strategy of Wrapper


In distribute environment, wrapper may be attacked by vicious data source, especially when the wrapper is executed in data source. The code of wrapper or the cryptographic key may be extracted. It is dangerous because the wrapper may be controlled to return mock information or the intension of user could not be protected. There are two instances of setup the wrapper: special computer and data source. The security is different in the two instances. If wrapper runs in special computer, to keep the security of wrapper is to keep the security of the computer and related network. Otherwise, if wrapper runs in data source, in order to protect the wrapper from the attack of run environment, the security strategy of mobile agent could be referenced [6][7]. The strategies are as following: Assure the run environment: before the wrapper is setup to unknown environment, the security of the environment should be assured. Record the log: Each recorder of the interaction between wrapper and environment. The recorder is sent to mediator in time when the wrapper is attacked. When the wrapper is attacked by the same means, the wrapper could use corresponding measure. Encrypt the important data and code of algorithm in wrapper: The important data and code of algorithms could be encrypted in order to protect the key part of the wrapper. Method CEF[8] is used to encrypt. Today there is no general method of CEF. But the method based on

Proceedings of the17 th International Conference on Advanced Information Networking and Applications (AINA03) 0-7695-1906-7/03 $17.00 2003 IEEE

También podría gustarte