Australasian Database Conference (ADC 2000),
Jan. 31 - Feb. 3,
Australian Computer Science Communications, Vol. 2, No. 2, IEEE CS Press,
An Integrated Architecture for Exploring, Wrapping,
Mediating and Restructuring Information from the Web
The goal of information extraction from the Web is to provide
an integrated view on heterogeneous information sources. A main
problem with current wrapper/mediator approaches is that they rely
on very different formalisms and tools for wrappers and mediators,
thus leading to an "impedance mismatch" between the wrapper and
mediator level. Additionally, most approaches currently are tailored
to access information from a fixed set of sources.
In this paper, we discuss an architecture where Web exploration,
wrapping, mediation, and querying is done in an integrated system.
Such an architecture reveals significant advantages in combination
with a unified framework - i.e., data model and language - in
which all tasks are done. Our approach is based on a unified model
of the application-level information and the relevant fragment of
the Web, and on an integrated language for accessing the Web,
wrapping, mediating, and querying information.
In this world model, in contrast to other approaches, the relevant
part of the Web becomes a part of the internal world model of the
system. This allows for a data-driven Web exploration which is
independent from a given network of individual predefined wrappers
and mediators. Thus, in addition to the classical wrapping and
mediating functionality, a system in this architecture can be
equipped with Web navigation and exploration functionality.
In an abstract sense, the system comprises a universal wrapper which
can be applied to arbitrary Web data sources which become known to
the system during information processing. Equipped with suitably
intelligent rules, the system can potentially explore before unknown
parts of the Web, thus coping with the steady growth of the Web.
The architecture is implemented in the FLORID system.