Documentos de Académico
Documentos de Profesional
Documentos de Cultura
Hongzhi Wang Jianzhong Li Zhenying He Department of Computer Science and Technology, Harbin Institute of Technology whongzhi@0451.com lijz@banner.hl.cninfo.net hzy_hit_cn@sina.com Abstract
In this paper, we focus on the problem in information integration system of obtaining data from heterogeneous data source accurately and effectively. XML is used as data exchange format of the wrapper. We design the wrapper architecture based on the conversion and management of the views as the bridge from global schema to local schema of various data sources. Our wrapper has two main subsystems, data extract subsystem and query executor subsystem. The former is for loading data for the cache in mediator when changes more than umbral threshold are detected, and the latter is for answering the query from the mediator. The architecture adapts to the data and schema change of the data source and could answer the query of mediator effectively. Considering the wrapper may run in the environment without control, the process in wrapper should be simple enough. The storage in wrapper itself should be as small as possible and the storage of data source could be used. The detail of modules query rewrite, view management, query merge, result wrap and schema change detect are discussed. The behavior of wrapper during the query process in wrapper is discussed with a running example. The security strategy, especial in the distance that the wrapper runs in autonomic data source, is also introduced in this paper. information integration system is that we use a XML warehouse as cache holding most frequently queried information. And our system integrates information from relational DBMS, object-relation DBMS, object-orient DBMS and unstructured data source, especially the web data source. The schema of the former three kinds of data source could be represented as semi-structured data in tree structure. The nodes set in local schema is L, and the nodes set in global schema is G. The mapping between global schema and local schema is mapping function f: The query to global schema is at first converted to the query to data source s local schema by the stored mapping function. Wrapper submits the query to data source, wraps the data returned from data source and sends it to mediator. Finally mediator integrates all the information returned from difference data sources into single result set. The query process to unstructured web information is quite different. The component converting query to global schema to keywords or other form to input to the form of web set is necessary, so is the component of extracting unstructured web pages into semi-structured information. The component is called query proxy. The wrapper to web site is to input legal query to the form on web site and return the links or web pages returned to query proxy There are many works of wrapper in various information integration systems. And many algorithms for schema conversion, view selection and query rewrite in wrapper were presented in former works. Due to the space, a complete version of this paper with survey of related works is in [9]. The Architecture of our system is shown in fig.1. In this paper, we will Mediator XML focus on the warehouse instance of Wrapper Proxy wrappers for Wrapper Crawler structured and Wrapper Wrapper semistructured OODB RDB data sources while Search Search the wrapper to engine engine unstructured data source is left as the WEB problem of Fig.1The Architecture of Information Integration metasearch and information integration. Our contributes in this paper are:
1.Introduction
Modern information system needs to query not only the local, homogeneous data source, but also heterogeneous data source in distribute environment even unknown data source hidden in web. Getting information from heterogeneous data source to answer the query on the global schema is the problem of information integration. There are three architecture of information integration system: federal database, data warehouse and mediator. [1] Now, mediator [2] based information integration system is used most with two main parts, mediator and wrapper. Wrapper orients to data source and the task of it includes processing the query from the mediator and transmitting it to the data source, extracting data from data source for the cache of mediator and detecting the schema change of data source for mediator to maintain the data conversion rules from global schema to local scheme in data source. Our information integration system is the mediator system with XML as data exchange and representation format. The difference from our system to other
Proceedings of the17 th International Conference on Advanced Information Networking and Applications (AINA03) 0-7695-1906-7/03 $17.00 2003 IEEE
A complete architecture of wrapper adapting to mediator with XML warehouse as cache is accorded. The behavior of the wrapper is introduced in detail. The security strategy is introduced. This paper is organized as follows. Section 2 overviews wrapper architecture, and in this section the function of each component and the process is introduced. In section 3, basic data structures and behavior of each module are described. The security strategy is introduced in section 4. Related works are presented in section 5.
Query translater XML wrapper Query fit view Cleaned data View manager Query Answer cleaner SQL generator SQL statement for the DBMS Materialiazed view choose Query merger Query generator Wrappered data Query for the view Data wrapper Result Backend DBMS Data answerSchema the query change Schema change detector
merger and query to the data source is generated in query generator. Returned result is wrapped in data wrapper. The wrapper is light with only the buffer in traffic cope the pointer of the views to data source and about all the information of view stored into the data source.
Proceedings of the17 th International Conference on Advanced Information Networking and Applications (AINA03) 0-7695-1906-7/03 $17.00 2003 IEEE
with the schema in fig.b. Data source B is an XML database with its schema as fig.c.
3.1Query Rewrite
sfw?
In this step, the plan of query process will be generated. In order to discuss the problem relating to the query or view, the query is defined as SFW form, the object in which has path of it in the local schema. Since the query from mediator is related to the whole local schema, the first thing to do is to decompose the query into separate parts without any field relations between them. There are two methods to compound the result of the separated query, union different data item and Cartesian product. But the semantics of Cartesian product is hard to define and Cartesian product will produce very large result. Without special request, Union of the result is preferred. If the schemas of the result separated queries are the same, the query sentence could be connected with UNION-like operation. Otherwise, on the above of plan is the operation of APPEND with the meaning of wrapping and sending the result set of the queries one by one. The query generated by this step is sent to the data source. In our example, for data source A, if there is no other views over the database, the query from the mediator to list all kinds of books should be separated into the query to tables of different kind of books. The separated queries will be sent to data source respectively. This process is shown in fig.4
statistics of restriction is that if restriction ra contains rb, the usage times of ra and rb are both increased. The node with visiting time above the threshold and its parents in local schema is as the schema in Vq. The conjunction of the disjunction of the restriction with the usage time above a threshold is the condition of Vq.
Proceedings of the17 th International Conference on Advanced Information Networking and Applications (AINA03) 0-7695-1906-7/03 $17.00 2003 IEEE
example of returning the result with just three XML data items from data source B is given in fig.6. The detail of compression for communication is in [5]. Encryption of result may be required to keep the security for the data sending. The cryptographic key is the stipulation of the mediator and wrapper. The format of the result to return to mediator depends on the protocol between mediator and wrapper such as INEXP [4].
polynomial and rational number is feasible. The method could be implemented by storing a cryptographic key computed by a secret rational function.
References
[1] Hector Garcia-Molina et.al. Database System Implementation. Prentice Hall. 2000. [2] G.Wirderhold. Mediators in the Achitecture of Future Inofrmation Systems. IEEE Computer, 25:38-49. [3] Hongzhi Wang, Jianzhong Li, Zhenying He. A Storage Strategy for Compress XML Warehouse. NDBC2002. [4] Hongzhi Wang, Jianzhong Li, Zhenying He. INEXP: Information Exchange Protocol for Interoperability. ICADL2002. [5] Hongzhi Wang, Jianzhong Li, Zhenying He. Compress Communication and Query Process in the Distributed Information Integration System Based on XML. DPCS2002. [6] Christian F. Tschudin. Mobil agent security. Intelligent Information Agents. Springer, Germany 1999 [7] Joan Feigcnbaum, Peter Lee. Trust management and proof-carrying code in secure mobile code applications. DARPA Workshop on Foundations for Secure Mobile Code. March 1997 26. [8] Tomas Sanderand Christian F. Tschudin. Protecting mobile agents against malicious hosts. Mobile Agents and Security, Lecture Notes in Computer Science 1419. Springer, Berlin, 1998 [9] Hongzhi Wang, Jianzhong Li, Zhenying He. An Effective Wrapper Architecture to Heterogeneous Data Source. Technical Report of Harbin Institute of Technology.
Proceedings of the17 th International Conference on Advanced Information Networking and Applications (AINA03) 0-7695-1906-7/03 $17.00 2003 IEEE