Deliverable 3: Implement the word2vec program to find word embedding for Gujarati words.

Description

The aim of this deliverable was to understand language modeling through deep learning techniques. A language model is a probability distribution over all strings in a language. A program must differentiate between sentences whiles translating from one language to the other. Language models are designed by associating the probability of the next word given the previous words. As deep learning models cannot operate on the words, a floating-point vector is associated with each word. These vectors are called word embeddings. The deliverable was aimed to find the word embeddings for the set of words in the Gujarati language. If two different words are followed by the same words, then these two words get similar word embeddings. This similarity is measured through the cosine similarity index. The same set of words already had their word embeddings calculated for the English language. For the Gujarati language, Wikipedia articles containing these words were selected as a test corpus. The text was divided by the space and punctuations to create a word index. As Gujarati words cannot be processed as is, first each word was decoded using inbuilt python library. Tensorflow was used to generate the word embedding for a pair of words.

Source Code Download

RUNNING THE CODE

1. python word_embedding.py

REFERENCE:

[E. Charniak 2019] "Introduction to Deep Learning". Eugene Charniak. The MIT Press. 2019.