Project Blog

Week 11 - Apr 14 , 2021

Minutes

Training of model for 50 iterations did not affect the accuracy of the test data. The accuracy oscillated in the early 50 range with a peak of 53.56%. One reason for it could be the quality of training data. The training set consists of questions with improper grammar and maps wrong SPARQL templates to the natural language question. A smaller training set of better quality may change the performance of the model. Another cause could be the model parameters that need to be optimized more.
Instead of running the local setup of OpenTapioca server, the endpoint was extracted from the webpage directly. It is a POST request that takes the query as part of the form data. It returns the list of recognized entities and the possible tags that it could be mapped to along with the best possible match. Did not build the final SPARQL query with these results.
The SPARQL query templates as output from the model were combined with the subjects and predicates identified by Falcon2.0 service. Due to multiple entities identified, some natural language questions generated more than one possible query candidates. For some questions, no relations were identified by the API resulting in an incomplete SPARQL query. Some queries generated did not order the triples in the WHERE clause correctly which changed the meaning and output of the question in the natural language.
To find the correctness of the query one method is to take a sample of 50 natural questions. Compare the results of the system with human inferred results to verify the output produced by the system.

TODO:

Improve the accuracy of the training model. Clean up of dataset. Remove improper questions or correct grammar wherever possible. Make sure the templates in the train set are correctly classified.
Utilise the OpenTapioca results with the Falcon2.0 identifiers to build the final SPARQL queries.
Check the correctness of the system results.

Week 10 - Apr 06 , 2021

Minutes

Improved word embedding feature for the training model. Added relation between child-parent, part-of-speech tag of the word and the characters in the word. Loss function fixed with steadily decreasing values after each epoch. Testing accuracy for training data 10 epochs is ~65% and accuracy for test data is ~55%.
Connected to the Falcon 2.0 API and extracted the entities and relations for the 5k training and test data. Open Tapioca requires server setup on local machine to execute.
To check whether the query is correct - one way is to run the query and find if it generates any results. Or compare triple overlaps with human generated queries.

TODO:

Run the training model for larger iterations. Optimize the model parameters.
Experiment with the OpenTapioca setup for entity recognition.
Fill in the slots of the template predicted with the RDF triples identified.
Think of ways to quantify the correctness of the query.

Week 8 - Mar 23 , 2021

Minutes

Training of model completed with dependency parse trees and word embeddings. Resultant accuracy after 5 epochs is 47.78%.
New features to be added - relation between child-parent and the part-of-speech relevant to the word.
Identified two APIs for named entity recognition - Falcon2.0 supports entity and relation linking while OpenTapioca identifies only entities but with better results.

TODO:

Improve accuracy of the model : (1) Check the loss function to be unbiased for incorrect classification among template. (2)Add new features as discussed.
Complete the named entity recognition module to identify the RDF triples for the question.
Fill in the slots of the template predicted with the RDF triples identified.

Week 7 - Mar 16 , 2021

Minutes

Debugging of Tree LSTM model. Model failing at tree construction for select sentences.

TODO:

Debug and test the model.
Add relation and parts-of-speech feature for training.
Search for possible libraries to use for entity and relation linking.

Week 6 - Mar 9 , 2021

Minutes

Debugging of Tree LSTM model. Model tested successfully with a subset of five input questions. For a training set of ~6500 questions, the model fails at the pre-processing stage due to indexing issues.

TODO:

Debug and test the model.
Update the proposal .

Week 5 - Mar 2 , 2021

Minutes

Tree LSTM model forward pass implemented with Pytorch library.

TODO:

Train the LSTM with a subset of data including 2 or 3 template types.
Work on building the tree for the LSTM input. Decide on features to be used for training.

Week 4 - Feb 23 , 2021

Minutes

Presented final plan for the project. Final components as question analysis, template classification, named entity recognition and query building.
Initial processing of training dataset for the template classification task done with some cleaning. Further improvements with discarding and merging templates.
POS tagging and dependency tree parsing using the stanza library.
Tree LSTM to be used for identifying the sparql query template from the dependency tree.
Query evaluation could be done along similar to search engine optimization or systems that convert questions to sql queries.

TODO:

Work on writing the code for Tree LSTM.
Train the LSTM with a subset of data including 2 or 3 template types. Evaluate the accuracy of the model built with a few test questions.
Update the proposal and schedule according to the new plan.

Week 3 - Feb 16 , 2021

Minutes

No meeting.

TODO:

Create PPT with the high level plan of the software.
Work on one of the components of the new design.

Week 2 - Feb 09 , 2021

Minutes

Discussion on the software design. Plan to figure out the query building component of the software.

TODO:

Create PPT with the high level plan architecture of the software.
Update schedule

Week 1 - Feb 02 , 2021

Minutes

Kick-off meeting for the new semester.
CS 298 proposal review and discussion.

TODO:

Complete CS 298 proposal with two committee members.
Get add code for CS 298

Week 13 - Nov 24 , 2020

Minutes

Deliverable 4 demo - program reads from the wikidata dump as json objects. Loads the resource name and id into a map data structure. Every token in the input sentence is searched within the map for the corresponding index. Possible multi-word resources are searched through iteratively.

TODO:

Try connecting the POS tagging from deliverable 3 to deliverable 4 to correctly identify items and properties
First draft of report due.

Week 12 - Nov 17, 2020

Minutes

Rework on the schedule for the remaining of the semester
CS297 report first draft due on 1st Dec. Final submission of the report on 8th Dec.
Deliverable 4 scope discussed as - Read through the wikidata database dump for all items and properties. Search for the identified tokens in the input sentence. Maintain a bloom filter for easier search [ Possible prefix search of word* in the bloom filter]. Fetch the relevant identifier to be used in the Sparql query.

TODO:

Convert PDF pages to HTML
Upload deliverable 3 report as HTML
Work on deliverable 4

Week 11 - Nov 10, 2020

Minutes

Demo of the encoder-decoder model. Implemented with limited dataset due to unavailability of data. Results showed potential for training on larger datasets with longer sentences.
Deliverable 4 objective to use wikidata API or wikipedia page titles to map entities to items/articles. For questions, model needs to map question type to select clause and construct other parts of the sparql query from the verb and subject/object.

TODO:

Upload deliverable 3 report.
Work on architecture for deliverable 4.

Week 11 - Nov 03, 2020

Minutes

Preliminary implementation for entity tagging as identifying auxiliary part of speech from the Penn TreeBank and combining with the position of the word to classify the word as subject, verb or object
Better approach using sequence-to-sequence learning. Feed every word of input sentence to train the encoder LSTM and use start and stop codes on the training target values.

TODO:

Implement the encoder-decoder model for direct tagging as subject verb and object

Week 10 - Oct 27, 2020

Minutes

Demo of the word representation module using pre-trained GLove embeddings, character embeddings and single layer GRU networks
Next deliverable to create a neural network to identify the subject, object and verb of the sentence.
Applying seq2seq learning for language translation to sparql.
New reference - PAROT: Translating natural language to SPARQL https://www.sciencedirect.com/science/article/pii/S2590188520300032

TODO:

Upload report for word representation module
Neural network for POS tagging
Read chapter 5 on Sequence-to-Sequence Learning from Charniak's Introduction to Deep Learning.

Week 9 - Oct 20, 2020

Minutes

Improve the character embeddings by using a window before and after the target character instead of only looking at before characters.
Use pre-trained character embeddings if above does not work.

TODO:

Complete the word representation model

Week 8 - Oct 13, 2020

No meeting

Week 7 - Oct 6, 2020

Minutes

Ways to design neural net model for creating character embeddings
Creating SPARQL query clauses directly from the question instead of searching for relevant answers

TODO:

Build on implementation of character level embeddings

Week 6 - Sep 29, 2020

Minutes

Discussion on the neural net model designed with word and character embeddings

TODO:

Work on implementation of the word representation model with Freebase dataset.
Create PPT for the next model discussion

Week 5 - Sep 22, 2020

Minutes

Presentation on the reading for the week. Reference [2]

TODO:

Update and upload the PPTs as PDF
Update Deliverable 1 page. Add more information on the queries and a snippet of the output.
Decide on 3 specific QA systems to study related to linked data and Sparql queries.
Explore algorithms for the Data Matching and the Scoring steps.

Week 4 - Sep 15, 2020

Minutes

Completed Deliverable 1. Presented the materials attached with the deliverable 1 link

TODO:

Upload the PPT and Sparql queries onto the webpage
Read the paper [2] on types of Question Answering systems

Week 3 - Sep 08, 2020

Minutes

Knowledge representation systems and description logic languages for ontologies. Prolog computer engines as a precursor to RDF systems.
Suggested Reading for RDF: Semantic Web for the Working Ontologist

TODO:

Short presentation on Sparql and RDF
Five complex queries in Sparql

Week 2 - Sep 01, 2020

Minutes

Discussed the feasibility of finding trends in language evolution - especially with usage of words from another language. Concluded that with little information present and the added complexity of detecting inspiration for words from multiple languages the idea should be rejected. A possibly doable task could be to find the era from historical texts.
Discussed question answering systems and index creation using triples of subject object and predicates. Application in wikidata that can be queried using sparqle language. Finalized on translating natural language questions to sparqle queries as the project.

TODO:

Complete CS297 proposal
Watch the NOVA episode on Watson - The Smartest Machine on Earth
Work towards deliverable #1

Week 1 - Aug 25, 2020

Minutes

Initial topic discussion related to NLP and web crawlers
Projects suggested :
- Ranking results from multiple source in a web crawler
- A self-contained CLI for language translator
- Using word embeddings on historical text to determine the evolution of English language - like the Great Vowel Shift

TODO:

Write CS297 proposal