Web Data Integration and Data Management
- Advanced Bachelor or Master/Diploma in
Applied Computer Science or Information Systems (Wirtschaftsinformatik)
- Prerequisites/Vorbedingungen: Basic Knowledge in e.g. XML and/or RDF
- 6 ECTS
- Number of participants: max. 10
- Language: German and english are allowed. Reading of english text/documentation
There is a lot of data available in the Web and in the Semantic Web. Web data is usually
provided in a human-readable form of Web pages (including forms, the so-called Deep Web),
while it cannot be processed in a database-style way by users. Data Extraction,
e.g. from the CIA World Factbook or from Wikipedia, is thus a neverending "hot topic".
Apart from pattern-based approaches, also Natural Language Processing Approaches are
The Semantic Web (cf. lecture Semantic Web) makes
some attempts to provide, extend and/or annotate Web Data towards a machine-readable way.
For this, the RDF data format is used, together with the OWL ontology language for
Form of the Seminar
The intention of the seminar is to get an overview of the state of the art in data integration
from the Web and background data management.
For each topic, the following has to be done:
- a written tutorial-style paper that gives an overview of an
- evaluate some tools, write a report (installation, functionality,
usability, ...) [optionally german or english]
- prepare an illustrative medium-size case study using one or more tools
- a presentation giving the tutorial and showing a demo of how to
use it (about 90 minutes incl. discussion; optionally german or english).
Time Schedule: Kickoff Meeting
- first meeting at the beginning of the semester:
Monday 23.10. 14h c.t. SR 2.101, IFI: First Meeting
Assignment of topics and papers.
- An- und Abmeldung in FlexNever: bis 28.1.2018
- Query Languages (mostly for RDF):
- Mark Kaminski and
Egor V. Kostylev: Beyond Well-designed SPARQL (ICDT 2016).
The paper is based on the SPARQL theory-practice papers by
Arenas/Gutierrez/Perez that have been discussed in the Semantic Web lecture [Requires SemWeb]
- Paolo Guagliardo and Leonid Libkin: Correctness
of SQL Queries on Databases with Nulls (SIGMOD Record
Sept.2017). The paper is about SQL; the talk should evaluate the paper and draw conclusions for SPARQL-style
queries (where nulls are common). [Requires SemWeb]
- An approach that presents a true algebra over RDF data.
Leonid Libkin, Juan L. Reutter, Domagoj Vrgoc:
Trial for RDF: adapting graph query languages for RDF data. PODS 2013: 201-212;
and Martin Przyjaciel-Zablocki, Alexander Schätzle, and Georg Lausen
TriAL-QL: Distributed Processing of Navigational Queries
(WebDB 2015) [Requires SemWeb]
- Kemele M. Endris, Mikhail Galkin, Ioanna Lytra1, Mohamed Nadjib Mami, Maria-Esther Vidal, Sören Auer:
Querying the Linked Data Web by Bridging RDF Molecule Templates. (DEXA 2017) [Requires SemWeb]
- Jing Wang, Nikos Ntarmos, Peter Triantafillou:
GraphCache (EDBT 2017)
A Caching System for Graph Queries [Requires SemWeb]
Mikhail Galkin, Diego Collarana, Ignacio Traverso-Ribón, Maria-Esther Vidal, Sören Auer:
SJoin: A Semantic Join Operator to Integrate Heterogeneous RDF Graphs (DEXA 2017) [Requires SemWeb]
- Other Topics:
- WebScrapping with OXPath/Diadem (Oxford Univ.)
Comment: Evaluate, do a case study, and describe. Based on XPath/XML. No RDF knowledge required, but XML/XPath
- Wen Hua et al:
Short Text Understanding Through Lexical-Semantic Analysis (ICDE 2015),
- Data Exploration: the papers Data
Exploration with SQL Using Machine Learning Techniques (EDBT 2017)
and Interactive Data Exploration with Smart Drill-Down
- Deep Learning: DeepLearning4j is an Open-Source,
Distributed, Deep Learning Library. It is a java based toolkit for building, training and deploying Neural Networks.
- Recommender Systems: the papers
"Told you i didn't like it": Exploiting uninteresting items for effective collaborative filtering
(ICDE 2016) and Real-Time Video Recommendation Exploration
Note: Papers can be found via the DBLP
(originally, DBLP meant "Databases and Logic Programming", but by now it covers all topics in
or simply by searching for the paper title with google (this often yields the pdf directly).
A list of other papers of the same authors can then be found via DBLP.
- Nov/Dec: preparation of case studies and presentations, individual meetings
- January: presentations.
Prospectively, there will either be two presentations per week,
or the seminar takes place in conference style on