Chris Pollett > Students >
Sujata

    ( Print View )

    [Bio]

    [Project Blog]

    [CS297 Proposal]

    [TheoryOfComputing Slides-PDF]

    [Deliverable 1]

    [Deliverable 2-PDF]

    [Deliverable 3]

    [Deliverable 4]

    [CS297 Report-PDF]

    [CS298 Proposal]

    [Sub-deliverable 1]

    [Sub-deliverable 2]

    [CS298 Report-PDF]

    [CS298 Presentation Slides-PDF]

    [CS298 Project Code]

                          

























CS298 Proposal

Japanese Kanji Suggestion Tool

Sujata Dongre (sujata.dongre@gmail.com)

Advisor: Dr. Chris Pollett

Committee Members: Dr. Robert Chun (rchun@cs.sjsu.edu) and Dr. Mark Stamp (stamp@cs.sjsu.edu)

Abstract:

Many times, we can see that if we enter misspelled search term in any of the search engines like Google, it will provide some help with "Did you mean:...". Similarly in my project, I am trying to provide some suggestions for the wrong Japanese text entered by a user.

Japanese language has three types of writing styles - Hiragana, Katakana and the most difficult, Kanjis. In old days Japanese script was only written vertically. However, the horizontal writing style is more common nowadays. A single Kanji may be used to write one or more different compound words. From the point of view of the reader, Kanji are said to have one or more different "readings". Hence sometimes it becomes very difficult to understand what are the Kanjis and how to read them even if you know Japanese language.There are various different translation tools available nowadays that provide translation help. The famous websites for translation are as follows:

Yahoo: http://babelfish.yahoo.com/translate_txt
Google: http://translate.google.com/#
ALC: http://www.alc.co.jp/

Sometimes it becomes very difficult to understand what the Kanjis are and how to read them even if you know the Japanese language.There are various different translation tools available nowadays that provide translation help. However none of them provides any good suggestions for the incorrect Japanese words. In my project, I am developing a tool that will help users to correct their Japanese words in searching by giving them a list of correct Japanese words they might be looking for.

CS297 Results

  • Researched various Japanese language corpuses and downloaded Tanaka Corpus and written a program to search Japanese characters in the Tanaka Corpus
  • Learned Hidden Markov Model, working of Hidden Markov Model and Viterbi, Forward Viterbi, Backward Viterbi and HMM learning algorithms. These algorithms are the foundation of parsing Japanese language text
  • Written programs for Viterbi, Forward Viterbi and Backward Viterbi algorithms
  • Researched full-text searching techniques in MySQL for Japanese language. N-gram parser is available from MySQL 5.1 version. Installed N-gram parser and tested full-text search for Japanese language

Proposed Schedule

Week 1: (27Jan-3Feb)Write a program for HMM training algorithm. Test the program with the simple English text.
Week 2-3: (4Feb-18Feb)Write a program for extracting the Japanese words from the Tanaka text corpus file.
Week 4-6: (19Feb-12Mar)Make modifications in the above program of HMM training algorithm for the Japanese word, if required.
Week 7: (13Mar-19Mar)Test the program for HMM training algorithm for the Japanese text.
Week 8: (20Mar-26Mar)Deliverable 1: Japanese word segmenter program.
Week 9: (27Mar-2Apr)Search and download the existing search engine APIs
Week 10-11: (3Apr-17Apr)Deliverable 2: Update the existing search engine APIs
Week 12: (19Apr-26Apr)Prepare CS298 final report
Week 13: (27Apr-5May)Complete draft report for committee review; prepare project presentation slides
Week 13: (27Apr-5May)Defense in front of the committee
September 24Viterbi program modifications should be done
October 8Search with Tanaka Corpus file on command line should be done
October 22Testing with search, different tests should be done
October 26Start writing CS298 report, prepare slides
November 16/23Defense

Key Deliverables:

  • Software
    • Deliverable_1: Developing a Japanese word segmenter using the Hidden Markov Model. This deliverable is divided into the following sub-deliverables.
      • Write a program for the HMM training algorithm. Test this program with the English text.
      • Write a program that will extract all the Japanese words from the corpus text file and count the number of occurrences of that word.
      • Test the output of the above program by selecting any random substring and giving it as an input to this program. Check whether it matches with the Japanese words containing that substring or not.
      • Deliverable_2: Download and test the existing search engine APIs. This deliverable mainly includes experimenting and testing how the existing search engine APIs work. This deliverable is divided into the following sub-deliverables.
      • Download the source code for the existing search engine APIs and get familiar with the source code.
      • Make modifications in the existing source code of the API as per the project requirement.
  • Report
    • Final report consisting of the detailed description of the software used, algorithms implemented in the project will be delivered. The report will also contain the prior work done in this field, and how this project is innovative and different than the work that is already done. It will also include detailed explanation of all the experiments conducted and the results produced.

Innovations and Challenges

  • Developing a program for HMM training algorithm for the Japanese language text corpus.
  • Understanding how the existing search engine APIs can be modified for the project.

References:

The Tanaka Corpus. Retrieved August 26, 2009, from http://www.csse.monash.edu.au/~jwb/tanakacorpus.html

[1996] Statistical Language Learning. Eugene Charniak. MIT Press. 19996.

Viterbi algorithm, Retrieved Nov 4,2009, from http://en.wikipedia.org/wiki/Viterbi_algorithm

MySQL full-text parser plugin collection, Retrieved Dec 4,2009, from http://sourceforge.net/apps/mediawiki/mysqlftppc/index.php?title=Main_Page

Constantine P. Papageorgiou. 1994. Japanese word segmentationby hidden Markov model. In Proc. of the HLT Workshop, pages 283-288.