Institute for Informatics
Georg-August-Universität Göttingen

Databases and Information Systems

Uni Göttingen

Web Data Integration and Data Management
Summer 2014

Prof. Dr. Wolfgang May
Daniel Schubert, MSc

Technical Data

  • Advanced Bachelor or Master/Diploma in Applied Computer Science or Information Systems (Wirtschaftsinformatik)
  • Prerequisites/Vorbedingungen: Basic Knowledge in e.g. XML and/or RDF
  • 6 ECTS
  • Number of participants: max. 16-20 (about 8-10 talks of 1 or 2 persons)
  • Language: German and english are allowed. Reading of english text/documentation is required.

Time Schedule

  • first meeting at the beginning of the semester:
    Monday 28.4. 14h c.t. SR 2.101, IFI: First Meeting
    Assignment of topics and papers.
  • May/June: preparation of case studies and presentations, individual meetings
  • Registration/Deregistration in FlexNever is open until 30.6. 14:00.
  • July: presentations.
    We cannot use the Monday slot for presentations due to collisions with other seminars. Probably, some free slots of the Semantic Web lecture will be used (Wed 10-12, Thu 10-12, Fri 12-14) at the end of the lecture period.
  • Only two topics have been worked out:
  • 17.7. 10-12 and 14-16: IBM and onTop


There is a lot of data available in the Web and in the Semantic Web. Web data is usually provided in a human-readable form of Web pages (including forms, the so-called Deep Web), while it cannot be processd in a database-style way by users. Data Extraction, e.g. from the CIA World Factbook or from Wikipedia, is thus a neverending "hot topic". Apart from pattern-based approaches, also Natural Language Processing Approaches are used.

The Semantic Web (cf. lecture Semantic Web) makes some attempts to provide, extend and/or annotate Web Data towards a machine-readable way. For this, the RDF data format is used, together with the OWL ontology language for describing metadata.

Potential Topics

  • IBM RDF-to-relational Storage:
    Comment: a (very sophisticated) mapping to design a relational schema for mapping RDF data into a relational DB.
    Paper: Mihaela A. Bornea, Julian Dolby, Anastasios Kementsietsidis, Kavitha Srinivas, Patrick Dantressangle, Octavian Udrea, Bishwaranjan Bhattacharjee: Building an efficient RDF store over a relational database. SIGMOD Conference 2013: 121-132
    [assigned to S.S.]
  • The ontop Project (FU Bolzano)
    Comment: An approach for OBDA (ontology-based data access), querying existing (relational) DBs using ontology information. I.e., the other way round as our RDF2Rel: the relational DB exists, the user states SPARQL queries using a (manually developed) ontology. Ontop then provides the functionality for mapping SPARQL to the underlying DB.
    Paper: Mariano Rodriguez-Muro, Roman Kontchakov, Michael Zakharyaschev: Ontology-Based Data Access: Ontop of Databases. International Semantic Web Conference (1) 2013: 558-573
    [assigned to L.R.]
  • RDF database generated from wikipedia: DBpedia
    Comment: describe the project, evaluate its public interface and data quality, describe how its data is obtained, storage etc.
    [assigned to A.T.]
  • RDF database/ontology from wikipedia and geonames: YAGO/YAGO2
    Comment: describe the project, evaluate its public interface and data quality, describe how its data is obtained, storage etc.
    Starting Paper: Fabian M. Suchanek, Gjergji Kasneci, Gerhard Weikum: YAGO: A Core of Semantic Knowledge Unifying WordNet and Wikipedia. WWW 2007: 697-706
    [assigned to G.J.]
  • WebScrapping via Browser Automation with Selenium
    Comment: Evaluate, do a case study, and describe. Neither XML nor RDF knowledge required, but practical competence obviously needed.
    [assigned to O.S.]
  • A theoretical paper that presents a true algebra over RDF data.
    Leonid Libkin, Juan L. Reutter, Domagoj Vrgoc: Trial for RDF: adapting graph query languages for RDF data. PODS 2013: 201-212
  • WebScrapping with OXPath/Diadem
    Comment: Evaluate, do a case study, and describe. Based on XPath/XML. No RDF knowledge required, but XML/XPath knowledge required.
    [assigned to A.Z.]

Note: Papers can be found via the DBLP (originally, DBLP meant "Databases and Logic Programming", but by now it covers all topics in Computer Science), or simply by searching for the paper title with google (this often yields the pdf directly). A list of other papers of the same authors can then be found via DBLP.

Form of the Seminar

The intention of the seminar is to get an overview of the state of the art in data integration from the Web and background data management.

For each topic, the following has to be done:

  • a written tutorial-style paper that gives an overview of an approach,
  • evaluate some tools, write a report (installation, functionality, usability, ...) [optionally german or english]
  • prepare an illustrative medium-size case study using one or more tools (optionally: comparatively)
  • a presentation giving the tutorial and showing a demo of how to use it (about 90 minutes incl. discussion; optionally german or english).