Deliverable #3: Data Preprocessing Program.

This deliverable is a Python program to extract pages of ambiguous word from Wikipedia data.

The Wikipedia data dump is download from https://dumps.wikimedia.org/enwiki/20170201/enwiki-20170201-pages-articles.xml.bz2. Since the Wikipedia data is a huge bz2 file, I extracted pages of ambiguous words while decompressing the Wikipedia dump on the fly. Before decompressing the Wikipedia dump. A word list is primarily extracted from the disambiguation data on http://wiki.dbpedia.org/downloads-2016-04#h26493-2. I created a main dictionary based on this file. There are plenty pages with words in the main dictionary as the title is redirected to another page. Thus, an additional dictionary is created while decompressing the Wikipedia bz2 file with the main dictionary as a filter. Then, the additional dictionary is used as a filter to decompress the Wikipedia bz2 file again.

decompress-2.py