Deliverable 4: Dataset generation ideas
The purpose of the project is to research ideas on generating the dataset for the project.
We need images in Hindi and corresponding translated image in English to build the model and evaluate them. There can be many sources of image like United Nations (UN) website, India's Parliament - Raja Sabha's website, various novels in Hindi that have been translated to English, Wikipedia pages having both English and Hindi versions, etc. The main challenge securing the dataset is to figure out the corresponding translated image in English programmatically. It is even more challenging for websites like Wikipedia where all pages have not been translated. We have written scripts to capture images on the internet by crawling the web using headless browsers. As mentioned above, the main challenge is while crawling various pages in Hindi might not have English version and vice versa. Various other options like using Wikipedia dumps and other such methodologies would also need to be explored. The number of images that would be needed is around 30,000-40,000 images to train the model and evaluate the models using various metrics.