Institute for Informatics
Georg-August-Universität Göttingen

Databases and Information Systems

Uni Göttingen

Australasian Database Conference (ADC 2000),
Canberra, Australia, Jan. 31 - Feb. 3, 2000. Australian Computer Science Communications, Vol. 2, No. 2, IEEE CS Press, pp. 82-89.

An Integrated Architecture for Exploring, Wrapping, Mediating and Restructuring Information from the Web

Wolfgang May


The goal of information extraction from the Web is to provide an integrated view on heterogeneous information sources. A main problem with current wrapper/mediator approaches is that they rely on very different formalisms and tools for wrappers and mediators, thus leading to an "impedance mismatch" between the wrapper and mediator level. Additionally, most approaches currently are tailored to access information from a fixed set of sources. In this paper, we discuss an architecture where Web exploration, wrapping, mediation, and querying is done in an integrated system. Such an architecture reveals significant advantages in combination with a unified framework - i.e., data model and language - in which all tasks are done. Our approach is based on a unified model of the application-level information and the relevant fragment of the Web, and on an integrated language for accessing the Web, wrapping, mediating, and querying information. In this world model, in contrast to other approaches, the relevant part of the Web becomes a part of the internal world model of the system. This allows for a data-driven Web exploration which is independent from a given network of individual predefined wrappers and mediators. Thus, in addition to the classical wrapping and mediating functionality, a system in this architecture can be equipped with Web navigation and exploration functionality. In an abstract sense, the system comprises a universal wrapper which can be applied to arbitrary Web data sources which become known to the system during information processing. Equipped with suitably intelligent rules, the system can potentially explore before unknown parts of the Web, thus coping with the steady growth of the Web. The architecture is implemented in the FLORID system.