Seminar Web Data Integration and Data Management Winter 2018/19

Institute for Informatics
Georg-August-Universität Göttingen

Projektseminar
Web Data Integration and Data Management
Winter 2018/19

Prof. Dr. Wolfgang May	may@informatik.uni-goettingen.de
Lars Runge, M.Sc.,
Sebastian Schrage, M.Sc.

Technical Data

Advanced Bachelor or Master/Diploma in Applied Computer Science or Information Systems (Wirtschaftsinformatik)
Prerequisites/Vorbedingungen: Basic Knowledge in e.g. XML and/or RDF
6 ECTS
Number of participants: max. 10
Language: German and english are allowed. Reading of english text/documentation is required.

There is a lot of data available in the Web and in the Semantic Web. Web data is usually provided in a human-readable form of Web pages (including forms, the so-called Deep Web), while it cannot be processed in a database-style way by users. Data Extraction, e.g. from the CIA World Factbook or from Wikipedia, is thus a neverending "hot topic". Apart from pattern-based approaches, also Natural Language Processing Approaches are used.

The Semantic Web (cf. lecture Semantic Web) makes some attempts to provide, extend and/or annotate Web Data towards a machine-readable way. For this, the RDF data format is used, together with the OWL ontology language for describing metadata.

Form of the Seminar

The intention of the seminar is to get an overview of the state of the art in data integration from the Web and background data management.

For each topic, the following has to be done:

a written tutorial-style paper that gives an overview of an approach,
evaluate some tools, write a report (installation, functionality, usability, ...) [optionally german or english]
prepare an illustrative medium-size case study using one or more tools (optionally: comparatively)
a presentation giving the tutorial and showing a demo of how to use it (about 90 minutes incl. discussion; optionally german or english).

Time Schedule

first meeting at the beginning of the semester:
Monday 22.10. 14h c.t. SR 2.101, IFI: First Meeting
Assignment of topics and papers.
Nov/Dec: Meeting with each of the participants; for that date there must be evidence that the tutorial/presentation/case study will be successful.
Dec/Jan: preparation of case studies and presentations, individual meetings
Jan/Feb: presentations

Potential Topics

Query Languages (mostly for RDF):

Mark Kaminski and Egor V. Kostylev: Beyond Well-designed SPARQL (ICDT 2016). The paper is based on the SPARQL theory-practice papers by Arenas/Gutierrez/Perez that have been discussed in the Semantic Web lecture [Requires SemWeb]
Paolo Guagliardo and Leonid Libkin: Correctness of SQL Queries on Databases with Nulls (SIGMOD Record Sept.2017). The paper is about SQL; the talk should evaluate the paper and draw conclusions for SPARQL-style queries (where nulls are common). [Requires SemWeb]
An approach that presents a true algebra over RDF data.
Leonid Libkin, Juan L. Reutter, Domagoj Vrgoc: Trial for RDF: adapting graph query languages for RDF data. PODS 2013: 201-212; and Martin Przyjaciel-Zablocki, Alexander Schätzle, and Georg Lausen TriAL-QL: Distributed Processing of Navigational Queries (WebDB 2015) [Requires SemWeb]
Jing Wang, Nikos Ntarmos, Peter Triantafillou: GraphCache (EDBT 2017) A Caching System for Graph Queries [Requires SemWeb]

Other Topics:

K. Venkatesh Emani,Karthik Ramachandra, Subhro Bhattacharya,S. Sudarshan
Extracting Equivalent SQL from Imperative Code in Database Applications and DBridge: Translating Imperative Code to SQL
DBridge: Is a system for optimizing data access in database applications by using static program analysis and program transformations.
Tool might not be available...
SQL Query Equivalence Prover
Shumo Chu, Daniel Li, Chenglong Wang, Alvin Cheung and Dan Suciu
Demonstration of the Cosette Automated SQL Prover and Cosette: An Automated Prover for SQL
These papers present a new formalism and implementation for reasoning about the equivalence of SQL queries.
Explain the overall structure, the introduced formalisms and try to find out the limitations of this approach. We will provide some example queries which should be extended up to the limitation of the prover.
The application can be found here.

Natural Language Interfaces:

Compair ATHENA and the classical NaLix
Diptikalyan Saha, Avrilia Floratou, Karthik Sankaranarayanan, Umar Farooq Minhas, Ashish R. Mittal, Fatma Ozcan
ATHENA: An Ontology Driven System for Natural Language Querying over Relational Data Stores and Constructing a Generic Natural Language Interface for an XML Database
Explain how ATHENA works and show the similarities and the differences as well as the limitations of both approaches, especially point out the advantages of using XML or SQL.
Both applications are not freely available as far as we know, but it could not hurt to give it a try.
AutoSPARQL
Konrad Höffner, Christina Unger, Lorenz Bühmann, Jens Lehmann, Axel-Cyrille, Ngonga Ngomo, Daniel Gerber, Phillip Cimiano
AutoSPARQL Demo Paper (KESW 2013) and AutoSPARQL Paper (ESWC 2011)
This is quite an old tool (2011-2013) for converting natural language queries to SPARQL using machine learning. The project can be found at AutoSPARQL Project, but is no longer maintained.
[!Warning: Due to the status of the project, it will take some work to make it run again (if at all)] [Requires SemWeb]
NaLIR
Fei Li, H. V. Jagadish
NaLIR Demo Paper (SIGMOD 2014) and NaLIR Paper (VLDB 2014)
A tool for querying relational databases using natural language for MySQL. Featuring the automatic rewriting of query sentences to fit the semantic input space. The project can be found at NaLIR Project
Recently the tool was used for the "backwards direction", i.e. getting natural language answers explaining the results using the context of the natural language query.
Danial Deutch, Nave Frost, Amir Gilad
Natural Language Explanations for Query Results (SIGMOD 2018), tool not available at the moment :( .

Entity Matching:

Magellan
Pradap Konda, Sanjib Das, Paul Suganthan, AnHai Doan, Adel Ardalan, Jeffrey R. Ballard, Han Li, Fatemah Panahi, Haojun Zhang, Jeff Naughton, Shishir Prasad, Ganesh Krishnan, Rohit Deep, Vijay Raghavendra
Magellan: Toward Building Entity Matching Management Systems (VLDB 2016)
A Python project that "guides" the user to the optimal use of matching techniques. The project can be found at Magellan Project and provides an extensive user manual.
[Requires Python]
DeepMatcher
Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, Vijay Raghavendra
Deep Learning for Entity Matching: A Design Space Exploration (SIGMOD 2018)
Again the Magellan project, but this time with deep learning! Needless to say, this topic requires some familiarity with machine learning/deep learning and is probably more difficult to set up. The project can be found at DeepMatcher Project.
[Requires Python]

Database Management Systems:

InVerDa
Kai Herrmann, Hannes Voigt, Torben Bach Pedersen, Wolfgang Lehner
Multi-schema-version data management: data independence in the twenty-first century (VLDB 2018)
This DBMS specialises in multi-schema data management. A schema can be evolved in different ways, but the resulting schemata stay connected so that data changes can be propagated between them. The project can be found at InVerDa Project.

Note: Papers can be found via the DBLP http://www.dblp.org (originally, DBLP meant "Databases and Logic Programming", but by now it covers all topics in Computer Science), or simply by searching for the paper title with google (this often yields the pdf directly). A list of other papers of the same authors can then be found via DBLP. Note that many publishers' pages are only freely accessible from university computers.

Projektseminar Web Data Integration and Data Management Winter 2018/19

Technical Data

Contents

Form of the Seminar

Time Schedule

Potential Topics

Projektseminar
Web Data Integration and Data Management
Winter 2018/19