Deliverable-1
Following are submitted
- A php program trie_eng.php to build a trie of English dictionary words in a gzip format
- The file of English dictionary words used by the above program
Details
The aim was to construct a data structure of English dictionary words that
can be used to auto complete the words while a user starts typing in Yioop. The
suitable one is a Trie [1]
The php program
- Creates a Trie in which words are stored using multi-level php arrays
- The Trie is then JSON encoded and gzip version will be the output
- Eliminates any words with less than 3 letters or stop words or any words which has non-ASCII characters
- The final gzip file is around 250KB, which is a reasonable size to send over network and load while using Yioop [2]
This will be loaded whenever a user accesses Yioop website and further
processed using Javascript on the client machine.
To run the program, do the following using command line
php trie-eng.php dic_file_name
Code and dictionary file
Timing tests -
Experiments were conducted to calculate the page load time when trying to load
the Trie with the website. This is monitored using Firefox Web Console option.
Loading the JSON Trie of size 2.5MB takes around 2.5 seconds
The gzip option of HTTP was enabled in Apache webserver, by adding the
following statements in httpd.conf of Apache.
IfModule deflate_module
SetOutputFilter DEFLATE
IfModule
It was seen that HTTP gzips the 2.5 MB JSON encoded Trie and loads in around
400ms, which is far less than loading a Trie directly.
Already zipped file which is about 250KB, would load in 35ms as shown below.
The third option is to compress the Trie with a gz extension and modify the
Accept Encoding in http to gzip,deflate. By providing this option, the
browser expects a compressed file and uncompresses it on the fly. For this I
activated the following options in httpd.conf
# AddEncoding allows you to have certain browsers uncompress
# information on the fly. Note: Not all browsers support this.
AddEncoding x-compress .Z
AddEncoding x-gzip .gz .tgz
This takes just 3 ms to load the Trie and the browser automatically
decompresses and makes the JSON Trie available for autosuggest
After conducting these experiments, it is concluded that
- A compressed Trie with gz extension will be made available on Yioop server
- Httpd.conf will be modified to accept gzip compressed files
- The browser will unzip the data and it will be used for autosuggest
References
[1] http://en.wikipedia.org/wiki/Trie
[2] http://developer.yahoo.com/performance/rules.html
|