CS297 Proposal
Word Sense Determination From Wikipedia Data Using A Neural Net
Qiao Liu (nicole.liuqiao@gmail.com)
Advisor: Dr. Chris Pollett
Description:
Many words have multiple meanings. For example, plant can mean a type of living organism or a factory. Being able to determine the sense of such words is very useful in natural language processing tasks, such as speech synthesis, question answering, and machine translation.
As a part of the project, we will use a modular model to classify the sense of words to be disambiguated. This model consisted of two parts: The first part was a neural-network-based
language model to compute continuous vector representations of words from data sets created from Wikipedia pages. The second part classified the meaning of the given word without
explicitly knowing what the meaning is.
Schedule:
Week 1:
Feb. 14 - Feb. 21 | Discuss on the topics. Download wikipedia dataset and understand the format of disambiguation page. |
Week 2:
Feb. 21 - Feb. 28 | Take Coursera Machine Learning course(Week1&2). Install TensorFlow. |
Week 3:
Feb. 28 - Mar. 7 | Take Coursera Machine Learning course(Week3&4). Understand TensorFlow basic logic. Study Python. |
Week 4:
Mar. 7 - Mar. 14 | Deliverable #1: An example program to run in TensorFlow.
Study Python. Machine learning(Week5). |
Week 5:
Mar. 14 - Mar. 21 | Coursera Machine learning(Week6&10). Literature review on neural network language model. |
Week 6:
Mar. 21 - Apr. 4 | Study TensorFlow. Literature review on word embedding. |
Week 7:
Apr. 4 - Apr. 11 | Deliverable #2: Presentation on word embedding. Literature review on neural network language model and word embedding |
Week 8:
Apr. 11 - Apr. 18 | Extract data from wikepedia and data preprocessing in small scale. |
Week 9:
Apr. 18 - Apr. 25 | Extract data from wikepedia. |
Week 10:
Apr. 25 - May 2 | Deliverable #3: Data preprocessing program. |
Week 11:
May 2 - May 9 | Start working on CS297 Final Report |
Week 12:
May 9 - May 16 | Deliverable #4: Complete the CS297 Final Report |
Deliverables:
The full project will be done when CS298 is completed. The following will
be done by the end of CS297:
1. An example program to run in TensorFlow
2. Presentation on word embedding
3. Data preprocessing program
4. CS297 Final Report: This is the culminating document for this semester's activities.
It will include:
4.1 An overview of the project problem
4.2 Summary of approaches of the problem
4.3 Discuss the platform I plan to use
4.4 Discuss the technology I plan to experiment
References:
Christopher Olah, "Deep Learning, NLP, and Representations", http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/
Silviu Cucerzan, "Large-Scale Named Entity Disambiguation Based on Wikipedia Data", 2007
Rada Mihalcea, "Using Wikipedia for Automatic Word Sense Disambiguation", 2007 |