Chris Pollett > Students >
Sujata

    ( Print View )

    [Bio]

    [Project Blog]

    [CS297 Proposal]

    [TheoryOfComputing Slides-PDF]

    [Deliverable 1]

    [Deliverable 2-PDF]

    [Deliverable 3]

    [Deliverable 4]

    [CS297 Report-PDF]

    [CS298 Proposal]

    [Sub-deliverable 1]

    [Sub-deliverable 2]

    [CS298 Report-PDF]

    [CS298 Presentation Slides-PDF]

    [CS298 Project Code]

                          

























CS297 Proposal

Japanese Kanji Suggestion Tool

Sujata Dongre (sujata.dongre@gmail.com)

Advisor: Dr. Chris Pollett

Description:

Many times, we can see that if we enter misspelled search term in any of the search engines like Google, it will provide some help with "Did you mean:...". Similarly in my project, I am trying to provide some suggestions for the wrong Japanese text entered by a user.

Japanese language has three types of writing styles - Hiragana, Katakana and the most difficult, Kanjis. In old days Japanese script was only written vertically. However, the horizontal writing style is more common nowadays. A single Kanji may be used to write one or more different compound words. From the point of view of the reader, Kanji are said to have one or more different "readings". Hence sometimes it becomes very difficult to understand what are the Kanjis and how to read them even if you know Japanese language.There are various different translation tools available nowadays that provide translation help. The famous websites for translation are as follows:

Yahoo: http://babelfish.yahoo.com/translate_txt
Google: http://translate.google.com/#
ALC: http://www.alc.co.jp/

But what if the Japanese term that you are entering for search is itself wrong? e.g. You are reading some Japanese text on website. You come across a sentence like "まずは、正しい英語学習法に頭をCHANGEしてください。" You do not understand the meaning of the sentence as you are unable to read the Kanjis. Now, you decide to use one of the above tools for translation. But you do not even understand which Kanjis to copy and mistakenly, you select "習法".Search results given by the above three translation websites are as follows:

Yahoo: Learning Method
Google: 習法
ALC: Result not found.

Hence, the purpose of my project is to ask user, "Did you mean: "学習法"?", which is the correct Japanese term and also has the equivalent English meaning.

Schedule:

Week 1: Aug26-Aug281. Prepare CS297 Proposal
2. Refresh Japanese language
Week 2: Aug31-Sep91. Study material on language processing
2. Study 'Statistical Language Learning' book
3. Study research papers from ACM Digital Library such as 'Using the web as a bilingual dictionary' and 'Automatic transliteration for Japanese-to-English Text Retrieval'
4. Search and study different Japanese text corpuses such as 'Kyoto Text Corpus' and 'Tanaka Corpus'
5. Work on Delivarable 1
Week 3: Sep10-Sep161. Deliverable 1 due: Report on experiments with installing and working of Japanese text corpus
Week 4: Sep17-Sep281. Search for standard Japanese grammar syntax for computers to run on corpus
2. Work on Deliverable 2
Week 5: Sep29-Oct71. Deliverable 2 due: Report on standard Japanese grammar techniques for computers
Week 6: Oct8-Oct261. Search and study existing parsers for Japanese text
2. Study an algorithm that can run on Japanese corpus
3. Work on Deliverable 3
Week 7: Oct27-Nov41. Deliverable 3 due: Report on experiments with parsers and algorithm
Week 8: Nov5-Nov181. Search and study MySQL methods for text based search in Japanese
2. Study online web dictionaries that can be used as a database. e.g. WWWWJDIC, EDICT and Japanese WordNet
3. Work on Deliverable 4
Week 9: Nov19-Nov251. Deliverable 4 due: Report on experiments with online dictionaries or MySQL
Week 10: Nov26-Dec21. Work on Deliverable 5: CS297 Report
Week 11: Dec3-Dec91. Deliverable 5 due: CS297 Report

Deliverables:

The full project will be done when CS298 is completed. The following will be done by the end of CS297:

1. Report on experiments with installing and running Japanese text corpus

2. Report on standard Japanese grammar techniques for computers

3. Report on experiments with parsers and algorithm

4. Report on experiments with online dictionaries or MySQL

5. CS297 Report

References:

Kyoto University Text Corpus 4.0. Retrieved August 26, 2009, from http://www-lab25.kuee.kyoto-u.ac.jp/nl-resource/corpus-e.html

The Tanaka Corpus. Retrieved August 26, 2009, from http://www.csse.monash.edu.au/~jwb/tanakacorpus.html

[1996] Statistical Language Learning. Eugene Charniak. MIT Press. 19996.

Nagata Masaaki, Saito Teruka, Suzuki Kenji, (2001). Using the web as a bilingual dictionary, Proceedings of the workshop on Data-driven methods in machine translation, p.1-8, Toulouse, France.

Qu Yan, Grefenstette Gregory, Evans David, (2003). Automatic Transliteration for Japanese-to-English Text Retrieval, Proceedings of the 26th annual international ACM SIGIR conference on Research and development in information retrieval, p. 353-360, Toronto, Canada.

Japanese WordNet. Retrieved August 26, 2009, from http://nlpwww.nict.go.jp/wn-ja/index.en.html

WWWJDIC: Online Japanese Dictionary Service. Retrieved August 26, 2009, from http://www.edrdg.org/cgi-bin/wwwjdic/wwwjdic?1C

The EDICT Dictionary File. Retrieved August 26, 2009, from http://www.csse.monash.edu.au/~jwb/j_edict.html

Kanji. Retrieved August 26, 2009, from http://en.wikipedia.org/wiki/Kanji