CS297 Proposal
Compare word2vec with hash2vec for Word Sense Disambiguation on Wikipedia corpus
Neha Gaikwad (neha.gaikwad@sjsu.edu)
Advisor: Dr. Chris Pollett
Description:
When a single word has multiple meanings it is difficult for machine to interpret this type of query. For example, Depression can mean a illness, weather system, economics.
There are many natural language processing based applications such as semantic analysis, machine translation, speech synthesis, information retrieval, etc.
This project focuses on determining the dis-ambiguity of sensed words using modular models. The project is divided into two parts:
1) Convert the Wikipedia dataset into hash and vectorise it using hashing trick.
2) Classify using modular model and then compare the results with the existing work on the exact same problem except the vectors were formed using word2vec.
Schedule:
Week 1:
Sep. 3 | Discuss on the topics and related information about it. |
Week 2:
Sep. 10 | Take online course on basics of Neural Networks by Stanford channel. Learn about Hash2vec and Word2vec. |
Week 3:
Sep. 17 | Take online course to study Neural Networks and explore different hashing techniques to perform on BOW model. |
Week 4:
Sep. 24 | Presentation on hash2vec. |
Week 5:
Oct. 1 | Deliverable1: Tesorflow example of word2vec. |
Week 6:
Oct. 8 | Deliverable 2 : word2vec implementation. |
Week 7:
Oct. 15 | Data Analysis for Wikipedia corpus data |
Week 8:
Oct. 22 | Deliverable 3 : Experimentation with hash2vec |
Week 9:
Oct. 29 | Deliverable 4 : Hash2vec implementation |
Week 10:
Nov. 5 | Extract data from Wikipedia |
Week 11:
Nov. 12 | Data analysis for data preprocessing |
Week 12:
Nov. 19 | Literature Review |
Week 13:
Nov. 26 | Literature review on difference between word2vec and hash2vec techniques |
Week 14:
Dec. 3 | Start working on CS297-Final project report |
Week 15:
Dec. 10 | Final draft of report |
Week 16:
Dec. 18 | Final report |
Deliverables:
The full project will be done when CS298 is completed. The following will
be done by the end of CS297:
1. Word2vec Implementation using tensorflow and genism libraries
2. Calculating similar words using word2vec vectors and cosine similarity
3. Hash2vec Implementation using mh3 hash function
4. Calculating nearby words using euclidean distance on hash2vec vectors
5. Experimentation with hash2vec : Implementation of Hashing Trick
i)An Overview
ii)Summary of Approaches
iii)Platform and technologies details I plan to use.
iv)Planning of my experimentation.
References:
[2016] "Hash2Vec: Feature Hashing for Word Embeddings". Luis Argerich, Matias J. Cano, and Joaquin Torre Zaffaroni. Publisher. 2016.
[2017] "Learning to understand phrases by embedding the dictionary". Hill, F., Cho, K., Korhonen, A., Bengio, Y. 2017.
[2013] ": Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems". Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., Dean, J. 2013.
|