Dagstuhl-Seminar "Declarative Data Access on the Web",
Schloss Dagstuhl, Germany, September 12-17, 1999. Dagstuhl-Report No. 251.

Information Extraction from the Web with FLORID

Wolfgang May


The talk presents an integrated architecture where Web exploration, wrapping, mediation, and querying is done in a monolithic system. The system is based on a unified framework -- i.e., data model and language -- in which all tasks are done. We regard the Web and its contents as a unit, represented in an object-oriented data model: the Web structure, given by its hyperlinks, the parse-trees of Web pages, and its contents all becomes part of the internal world model of the system. The advantage of this unified view is that the same data manipulation and querying language can be used for the Web structure and the application-semantic model: The model is complemented by a rule-based object-oriented language which is extended by Web access capabilities and structured document analysis and allows for accessing the Web, wrapping, mediating, and querying information. Due to this integration, a system in this architecture can be equipped with Web navigation and exploration functionality. We present generic rule patterns for typical extraction, integration, and restructuring tasks using this framework. We show the practicability of our approach by using the FLORID system. The approach is illustrated by two case-studies.