CS297 Proposal

Word Sense Determination From Wikipedia Data Using A Neural Net

Qiao Liu (nicole.liuqiao@gmail.com)

Advisor: Dr. Chris Pollett

Description:

Many words have multiple meanings. For example, plant can mean a type of living organism or a factory. Being able to determine the sense of such words is very useful in natural language processing tasks, such as speech synthesis, question answering, and machine translation. As a part of the project, we will use a modular model to classify the sense of words to be disambiguated. This model consisted of two parts: The first part was a neural-network-based language model to compute continuous vector representations of words from data sets created from Wikipedia pages. The second part classified the meaning of the given word without explicitly knowing what the meaning is.

Schedule:

Week 1: Feb. 14 - Feb. 21	Discuss on the topics. Download wikipedia dataset and understand the format of disambiguation page.
Week 2: Feb. 21 - Feb. 28	Take Coursera Machine Learning course(Week1&2). Install TensorFlow.
Week 3: Feb. 28 - Mar. 7	Take Coursera Machine Learning course(Week3&4). Understand TensorFlow basic logic. Study Python.
Week 4: Mar. 7 - Mar. 14	Deliverable #1: An example program to run in TensorFlow. Study Python. Machine learning(Week5).
Week 5: Mar. 14 - Mar. 21	Coursera Machine learning(Week6&10). Literature review on neural network language model.
Week 6: Mar. 21 - Apr. 4	Study TensorFlow. Literature review on word embedding.
Week 7: Apr. 4 - Apr. 11	Deliverable #2: Presentation on word embedding. Literature review on neural network language model and word embedding
Week 8: Apr. 11 - Apr. 18	Extract data from wikepedia and data preprocessing in small scale.
Week 9: Apr. 18 - Apr. 25	Extract data from wikepedia.
Week 10: Apr. 25 - May 2	Deliverable #3: Data preprocessing program.
Week 11: May 2 - May 9	Start working on CS297 Final Report
Week 12: May 9 - May 16	Deliverable #4: Complete the CS297 Final Report

Deliverables:

The full project will be done when CS298 is completed. The following will be done by the end of CS297:

1. An example program to run in TensorFlow

2. Presentation on word embedding

3. Data preprocessing program

4. CS297 Final Report: This is the culminating document for this semester's activities. It will include:

4.1 An overview of the project problem

4.2 Summary of approaches of the problem

4.3 Discuss the platform I plan to use

4.4 Discuss the technology I plan to experiment

References:

Christopher Olah, "Deep Learning, NLP, and Representations", http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/

Silviu Cucerzan, "Large-Scale Named Entity Disambiguation Based on Wikipedia Data", 2007

Rada Mihalcea, "Using Wikipedia for Automatic Word Sense Disambiguation", 2007