Stopping, Character n-grams, non-English Languages




CS267

Chris Pollett

Oct. 1, 2012

Outline

Finishing up Stemming

Example of stemmed text

Stopping

Quiz

Which of the following is true?

  1. For our Boolean document retrieval algorithms the notion of candidate solution and an actual solution are the same.
  2. Query response time is the average number of queries processed in a given amount of time.
  3. Lemmatization is roughly the process whereby a term is reduced to a word in the sense of a dictionary entry.

Characters

Understanding Unicode

Character n-grams

European Languages

CJK(V) Languages