Create a Dutch Stemmer for the Yioop Search Engine

Aim

The goal of this deliverable was to create a stemmer in Dutch that can be used in the Yioop search engine.

Overview

Stemmers are used in Information Retrieval (IR) systems and used for natural language processing. You are probably asking, what are stemmers and why are they important? A stemmer is an algorithm that reduces all words with the same stem to a common form [Lovins1968]. The reducing of words to a common form is known as conflation (to bring together). For example, connect, connected, connecting, connection and connections [Porter1980] would all conflate to connect. Have you guessed why stemmers are important yet? Since stemmers conflate all terms in a document, the databae storing the information would be able to hold one word for connect instead of each variation. This fact improves search engine query performance/recall and reduces the number of terms to store and index. Those two characteristics are highly advantages for a successful search engine and natural language processor. Stemmers specifically for IR was developed in 1968 by Julie Beth Lovins. Lovins ideas were so popular it spurred many other types of stemmers. Stemmer are broken down into three categories; truncating, statistical and mixed. A truncating stemmer does exactly what its name says. They remove the appropriate affixes (prefix or suffix) in a word. Statistical stemmers remove affixes based on statistical analysis like letter frequency. Last the mixed stemmer uses inflectual and derivation methods. In other words, the language syntax variations and the relations to the part of speech (POS) it comes from. More information can be found on this topic by reading the documents in the reference section.

Work Performed

The Dutch stemmer I worked on is a Truncating stemmer. It is based on the work of Martin Porter. In 1980, Porter presented a simple algorithm for stemming English language words [Willett2006]. Today Porter has a website full of many stemmers developed by him. Each supported language on Porter's website contains a sudo algorithm for implementing on your own and a definition of the stemmer written in snowball. There are two main reasons for creating Snowball. One is the lack of readily available stemming algorithms for languages other than English. The other is the consciousness of a certain failure on my part in promoting exact implementations of the stemming algorithm described in (Porter 1980), which has come to be called the Porter stemming algorithm [Porter2001]. Furthermore, with the help of Porter's website of the Dutch Stemmer I was able to start working on my Dutch stemmer. The language to write the stemmer in was chosen for me since the Yioop search engine is written in PHP. Now that I had my sudo code and programming language, I searched the internet to see what others had already done. I found the Simplicity Lab PHP Dutch Stemmer that claimed to be complete but it was not. I downloaded the code and started to take a look at it. It was not written in the format I would have liked so I the first thing I did was refactor it. Next I ported it to the format needed for the code to run within the Yioop search engine. To see how good it was I ran it against the vocabulary words and stems provided by Porter's website. To my amazement, it had tens of thousands of errors out of the 49000 plus words in the list. At this point I knew I was going to spend hours repairing the code I planned on using as my base. After may trials and errors the first smoking gun was the problems with the strtolower() and split() methods. The Dutch alphabet has characters with umlauts and accents and the strtolower() and split() methods cannot handle those types of characters. I changed the methods to the mb_strtolower() and preg_split respectfully. Next messaged the method that removes umlauts and accent characters. There were some vowels that needed to be removed and some that needed to be added. After that method was complete, I had to add words to the no stem list and review each of the 5 steps (play with regular expressions) until the stemmer passed all tests. In addition to the stemmer, the Yioop search engine also has a configure.ini file used to translate static phrases presented on the user. There are over 1100 of them. This was no easy task either. I spent hours trying to figure out the best way to skin this cat and finally came up with a full proof way to do it. If you have ever opened an .ini file you know the settings are stored in a <setting> = <value> format. First I opened the configure.ini file with notepad. I replaced " = " with a comma and saved it to a .csv file. That allowed me to open the file with Excel for manipulation. Excel was able to give me a representation of the file that had the settings and values seperated in columns and select only the values from the file. I copied about 400 or so values at a time and pasted them into Google Translate and clicked Transate. Once the words were translated, I copied the translations into the column next to the values. I repeated this process until all of the words were translated. I double checked a random sample of the words and I was confident the translations were correct. Next, I deleted the English values column and saved the file. Lastly I opened the file in notepad again and replaced " = " with a comma and viola I had my configure.ini file. I was pretty much home free but there was one more task that needed to be done. I had to quote each of the values. I wrote a C# application that split the settings and values by " = ", removed any quotes from the values, added them around the values and saved the file. After all of that was done, I submitted an issue and a patch to the MantisBT repository manager (Dr. Pollett) for review. Hopefully it will make it into the Yioop search engine at some point.

Results

In conclusion, I developed the below files to be used in a patch for the Yioop search engine to support the Dutch langauge. I would just like to say, this was a great learning experience. I have never worked with stemmers before, frankly have never heard of them. I am glad I had a chance to work on this topic.

References

[Lovins1968] Development of a Stemming Algorithm. Julie Beth Lovins, Mechanical Translation and Computational linguistics. 1968.
[Porter1980] An algorithm for suffix stripping. M.F. Porter. Program: Electronic Library and Information Systems. 1980.
[Willett2006] The Porter stemming algorithm: then and now. Peter Willett. Program: Electronic Library and Information Systems. 2006.
[Porter2001] Snowball: A language for stemming algorithms. M.F. Porter. Snowball. 2001.