CS297 Proposal

Compare word2vec with hash2vec for Word Sense Disambiguation on Wikipedia corpus

Neha Gaikwad (neha.gaikwad@sjsu.edu)

Advisor: Dr. Chris Pollett

Description:

When a single word has multiple meanings it is difficult for machine to interpret this type of query. For example, Depression can mean a illness, weather system, economics. There are many natural language processing based applications such as semantic analysis, machine translation, speech synthesis, information retrieval, etc. This project focuses on determining the dis-ambiguity of sensed words using modular models. The project is divided into two parts: 1) Convert the Wikipedia dataset into hash and vectorise it using hashing trick. 2) Classify using modular model and then compare the results with the existing work on the exact same problem except the vectors were formed using word2vec.

Schedule:

Week 1: Sep. 3	Discuss on the topics and related information about it.
Week 2: Sep. 10	Take online course on basics of Neural Networks by Stanford channel. Learn about Hash2vec and Word2vec.
Week 3: Sep. 17	Take online course to study Neural Networks and explore different hashing techniques to perform on BOW model.
Week 4: Sep. 24	Presentation on hash2vec.
Week 5: Oct. 1	Deliverable1: Tesorflow example of word2vec.
Week 6: Oct. 8	Deliverable 2 : word2vec implementation.
Week 7: Oct. 15	Data Analysis for Wikipedia corpus data
Week 8: Oct. 22	Deliverable 3 : Experimentation with hash2vec
Week 9: Oct. 29	Deliverable 4 : Hash2vec implementation
Week 10: Nov. 5	Extract data from Wikipedia
Week 11: Nov. 12	Data analysis for data preprocessing
Week 12: Nov. 19	Literature Review
Week 13: Nov. 26	Literature review on difference between word2vec and hash2vec techniques
Week 14: Dec. 3	Start working on CS297-Final project report
Week 15: Dec. 10	Final draft of report
Week 16: Dec. 18	Final report

Deliverables:

The full project will be done when CS298 is completed. The following will be done by the end of CS297:

1. Word2vec Implementation using tensorflow and genism libraries

2. Calculating similar words using word2vec vectors and cosine similarity

3. Hash2vec Implementation using mh3 hash function

4. Calculating nearby words using euclidean distance on hash2vec vectors

5. Experimentation with hash2vec : Implementation of Hashing Trick

i)An Overview ii)Summary of Approaches iii)Platform and technologies details I plan to use. iv)Planning of my experimentation.

References:

[2016] "Hash2Vec: Feature Hashing for Word Embeddings". Luis Argerich, Matias J. Cano, and Joaquin Torre Zaffaroni. Publisher. 2016.

[2017] "Learning to understand phrases by embedding the dictionary". Hill, F., Cho, K., Korhonen, A., Bengio, Y. 2017.

[2013] ": Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems". Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., Dean, J. 2013.