CS298 Proposal

Translating Natural Languages to SPARQL Query Language for RDF based Question Answering System

Shreya Satish Bhajikhaye (shreyasatish.bhajikhaye@sjsu.edu)

Advisor: Dr. Chris Pollett

Committee Members: Dr. Robert Chun, Dr.Katerina Potika

Abstract:

RDF is linked data represented as a graph. The semantic web uses this framework to collate all information on the web. SPARQL language was created to query data in a RDF model. There has been a lot of research done to improve the ease of use of RDF data for searching and querying purposes. This project focuses on creating a RDF-based question answering system. The input for the system is a query statement in English, which is translated to the equivalent SPARQL query. Upon reading a complete sentence, the software identifies the RDF triple entities and builds the respective query clauses.

CS297 Results

Learnt to construct SPARQL queries and search through the Wikidata database.
Studied different question answering systems and the techniques used in them. Implemented a character level embedding technique using Glove embeddings and single layer GRU model.
Implemented a entity(subject, object or verb) identifier on a small dataset of sentences using Sequence-to-Sequence learning model.
Created a program to find the relevant entities from the Wikidata dump.

Proposed Schedule

Week 1: Feb 2 - Feb 8	Review proposal for CS 298
Week 2: Feb 9 - Feb 15	Build the software design. Study different methods to construct the query after recognizing the triples.
Week 3 - Week 4: Feb 16 - Feb 29	Identify training and testing datasets for the model. Preprocess questions to create a clean dataset. Build the features from the data on which the model is to be trained. Features include constructing the dependency parse tree, the relation between two words in a question, the part of speech that is applicable to a word and the word embedding.
Week 5 - Week 7: Mar 2 - Mar 22	Write the code for Tree LSTM model from scratch. Combine the data processing and model training process. Test and debug the model to classify the input question to a predefined template.
Week 8 - Week 10: Mar 23 - Apr 12	Implement resource identifier module with word sense disambiguation for better results. Construct query from templates based on question type using question and resource information. Validate and quantify correctness of results using the Wikidata RDF query portal.
Week 11 - Week 15 : Apr 13 - May 17	Complete final CS 298 Report and presentation

Key Deliverables:

Software

Implement a classification model using Tree-LSTM that will classify the input question to a pre-defined SPARQL query template.
Create a search famework that correctly finds the relevant item/property identifier for the RDF triples in the Wikidata dump. Match multi-word entities in the triple to items in Wikidata.
Implement a SPARQL query formation module which uses the item/property identifier of the RDF triples to build the SELECT and WHERE clauses of the SPARQL query.
A functional system implemented in Python that takes an English sentence as input and returns the equivalent SPARQL query that can be queried on the Wikidata database.

Report
- CS 298 Report
- CS 298 Presentation

Innovations and Challenges

Tree-LSTM is recently developed model that has been created for parsing sentences with their word inter-relations. Due to it being recent, there are limited resources available for reference and also lesser prior research.
Identifying the correct resource in Wikidata related to the terms in the question is challenging. A single term can consist of multiple words or it can be abbreviated in different forms in the question. A predicate term can be described differently in the Wikidata properties so any synonyms need to be considered too. Disambiguation of words is another challenging area that can alter the intent of the question.
Constructing a structured query with multiple statements is a complex task as relations between multiple resources need to be explored to achieve the intended result. There can be multiple forms of writing the same query and each can produce different results.

References:

[1] Hamid Zafar, Giulio Napolitano, and Jens Lehmann. Formal query generation for question answering over knowledge bases. In The Semantic Web, pages 714-728, Cham, 2018. Springer International Publishing.

[2] Kai Sheng Tai, Richard Socher, and Christopher D. Manning. Improved semantic representations from tree-structured long short-term memory networks. CoRR, abs/1503.00075, 2015.

[3] Christina Unger, Lorenz Buhmann, Jens Lehmann, Axel-Cyrille Ngonga Ngomo, Daniel Gerber, and Philipp Cimiano. Template-based question answering over rdf data. WWW'12 - Proceedings of the 21st Annual Conference on World Wide Web, 04 2012.

[4] Dennis Diefenbach, Vanessa Lopez, Kamal Singh, and Pierre Maret. Core techniques of question answering systems over knowledge bases: a survey. Knowledge and Information Systems, 55(3):529-569, Jun 2018.

[5] Bob DuCharm. Learning SPARQL. O'Reilly, 2nd edition, 2013.