Portuguese Stemmer for Yioop

Description:

Stemming is a process for removing the commoner morphological and inflexional endings from words. Its main use is as part of a term normalisation process that is usually done when setting up Information Retrieval systems.

For grammatical reasons, documents are going to use different forms of a word, such as organize, organizes, and organizing. Additionally, there are families of derivationally related words with similar meanings, such as democracy, democratic, and democratization. In many situations, it seems as if it would be useful for a search for one of these words to return documents that contain another word in the set.

Yioop has currently supports of many locale. Stemming has different set of rules that are specific to locale. For this task, goal is to develop a stemmer for the Portuguese language.

Results:

Stemmer is developed as per the Snowball Algorithm's stemming algorithm.
Tested on the 32016 Portuguese words.
Patch accepted: Mantis Bug ID: 0000165

Deliverables:

1. /locale/pt/resources/tokenizer.php
2. /tests/pt_tokenizer_test.php
3. /portuguese_stemmer/input_vocabulary.txt
4. /portuguese_stemmer/stemmed_result.txt

References:

Yioop Documentation: Yioop Documentation
[2015] Portuguese stemming algorithm. Portuguese Stemmer 2015