Decide the Neural Network Architecture

This deliverable involves deciding the architecture of Neural Network model that I am going to use for next semester's implementation.

The main idea of this architecture is inspired by a proposed solution provided for image captioning problem by Google in Show and Tell: A Neural Image Caption Generator. In this paper, language translation problem was discussed. In earlier approaches of language translation, the problem was divided into multiple components where each component would work on subproblems. Those subproblems included translating words individually, aligning words, reordering, etc. But, in recent times, it has been realized and proven that this problem can be solved in the much easier way.

Machine translation or language translation problem can be solved using recurrent neural networks with the same state of the art performance. In this approach, an "encoder" RNN reads the source sentence and transforms it into a rich fixed-length vector representation, which in turn in used as the initial hidden state of a "decoder" RNN that generates the target sentence. For the image captioning problem this approach can be used which in turn can be used for our problem with a slight modification. This slight modification is replacing the encoder RNN by a deep convolution neural network (CNN).

Over the last few years, it has been seen that that convolutional neural networks can produce a rich representation of an image by embedding it to a fixed-length vector, such that this representation can be used for a wide variety of vision. Hence, it was an obvious choice to use a CNN as an image "encoder". In our approach, we will be first pre-training an image for classification task and use the last hidden layer as an input to the recurrent neural network decoder that generates the corresponding LaTeX.



Related Resource: