Chris Pollett > Students > Nicole

    Print View

    [Bio]

    [Blog]

    [CS 297 Proposal]

    [Del 1-Example Program]

    [Del 2-Introduction to Word Embedding]

    [Del 3-Data Preprocessing Program]

    [CS 297 Report_PDF]

    [CS 298 Proposal]

























CS298 Proposal

Retrieval of Related Entities from Wikipedia Data with Neural Network

Qiao Liu (nicole.liuqiao@gmail.com)

Advisor: Dr. Chris Pollett

Committee Members: Dr. Jon Pearce, Dr. Suneuy Kim

Abstract:

Frequently when people search for knowledge on the internet their searched-for query is only a poor look-up key for the knowledge values desired. Serving related topics to the search query can help eliminate this relevance gap. In this project, we propose to build a Related Entity Recommender that will find related entities based on word-entity embeddings learned by a neural network from Wikipedia titles, Wikipedia disambiguation pages, and Wikipedia page data. We hope that when given a query such as "Acupuncture", our system will be able to produce related entities such as "Homeopathy", "Osteopathic", and "Pain Management". As part of this project, we will try and compare different variants on the n-grams used to train the initial neural network word-embedding such as Skip Grams and CBOW.

CS297 Results

  • Implemented the program to extract pages from Wikipedia data dump
  • Implemented a machine learning program in TensorFlow
  • Reviewed the reference papers

Proposed Schedule

Week 1, 2: Aug. 29 - Sep. 11Build a program that extracts entities from Wikipedia.
Week 3: Sep. 12, - Sep. 18Deliverable 1 due: A program that generate training data by extracting entity relationship from Wikipedia.
Week 4: Sep. 19, - Sep. 25Build a Skip-Gram model using neural network to learn entity embeddings.
Week 5, 6: Sep. 26, - Oct. 2Deliverable 2 due: Entity Embeddings learned by neural network with Skip-gram model.
Week 7: Oct. 10, - Oct. 16Build a program that finds related entities by calculating similarity on the top of entitiy embeddings.
Week 8: Oct. 17, - Oct. 23Deliverable 3 due: A program that finds related entities with Skip-gram model.
Week 9, 10: Oct. 31, - Nov. 6Deliverable 4 due: A program that finds related entities with CBOW model.
Week 11: Nov. 7, - Nov. 13Build a program that uses some other algorithms(TBD) to learn Entity Embeddings.
Week 12: Nov. 14, - Nov. 20Deliverable 5 due: A program that finds related entities using other model(TBD).
Week 13: Nov. 21, - Nov. 27Analyze and integrate all programs to support related entity search.
Week 14, 15: Nov. 28, - Dec. 11Deliverable 6 - Report
Week 16: Dec. 12, - Dec. 18Deliverable 6 - Presentation

Key Deliverables:

  • Software
    • Program to generate training data by extracting entity relationships from Wikipedia pages.
    • Program to learn entity embedding by neural network with Skip-gram model and CBOW model.
    • Program to find related entities on the top of entity embedding.
    • Program serves as the simple interface where people type in the keyword and get the related topics.
  • Report
    • CS298 Report
    • CS298 Presentation

Innovations and Challenges

  • A new approach to find related entities using embedding and the neural network.
  • We know the entities appear on the same page have a relationship, but how to present the relationship in training data is a challenge.
  • There are papers on word embedding and sentence embedding, but there are very few pages related with entity embedding. So, doing research on this and implement it is a challenge.

References:

Christopher Olah, "Deep Learning, NLP, and Representations", http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/

Sanjeev Arora, Yingyu Liang, Tengyu Ma, "A Simple but Tough-to-Beat Baseline for Sentense Embeddings", ICLR, 2017

Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean, "Efficient Estimation of Word Representations in Vector Space", arXiv:1301.3781, 2013

Yoshua Bengio, Rejean Ducharme, Pascal Vincent, Christian Jauvin, "A Neural Probabilistic Language Model", Journal of Machine Learning Research 3, 2003