Chris Pollett > Students >
Sujata

    ( Print View )

    [Bio]

    [Project Blog]

    [CS297 Proposal]

    [TheoryOfComputing Slides-PDF]

    [Deliverable 1]

    [Deliverable 2-PDF]

    [Deliverable 3]

    [Deliverable 4]

    [CS297 Report-PDF]

    [CS298 Proposal]

    [Sub-deliverable 1]

    [Sub-deliverable 2]

    [CS298 Report-PDF]

    [CS298 Presentation Slides-PDF]

    [CS298 Project Code]

                          

























Deliverable 1

Description: This deliverable is about writing a program that will check whether the given input Japanese character/Kanji is present in the Tanaka Corpus file or not. If the character is present, the program will display all the lines containing that Japanese character. The name of this program is 'kanji_character_search.py'
The Tanaka Corpus was compiled by Professor Yasuhito Tanaka at Hyogo University and his students. For more detailed information about Tanaka Corpus, please visit the following website:
Tanaka Corpus

Input: Japanese character/Kanji
Output: Lines containing entered Japanese character/Kanji

Instructions to run the program:

  1. Python version 2.6.1
  2. Tanaka Corpus File
  3. $./kanji_character_search.py -f Tanaka Corpus filename -k Japanese character OR Kanji

To run the program, you should provide two arguments, the filename for search, and the Japanese character/Kanji to be searched in the file. If any of these arguments is missing, program will display a help message asking the user to enter correct arguments.

The Tanaka Corpus file includes Japanese as well as English sentences. To extract only the Japanese parsing sentences from the existing Tanaka Corpus file, there is one more program. This program extracts Japanese parsing sentences from the existing Tanaka Corpus file and writes them to the new file. This new file will only have Japanese sentences. The name of this program is 'remove_english.py'

Example of running kanji_character_search.py:

  • Woody:CS297 sujata$ ./kanji_character_search.py -f test_corpus.txt -k ス
    「17歳の時スクーナー船で地中海を航海したわ」彼女はゆっくりと注意深く言う。
    「1秒6ペンスだからね」とボブが念を押す。
    「4ポンド50ペンス」とボブが言う。

    In the above example, the Tanaka corpus file is 'test_corpus.txt' and the Japanese character to be searched in the file is 'ス'. The output is all the lines having character 'ス' in it.
  • Woody:CS297 sujata$ ./kanji_character_search.py -f test_corpus.txt -k 日
    Character not found.

    If the character is not found in the file, then the message "Character not found." will be shown to the user.
  • Woody:CS297 sujata$ ./kanji_character_search.py -h
    Usage: kanji_character_search.py [options]
    Options:
    -h, --help show this help message and exit
    -f FILE, --file=FILE corpus file FILE
    -k KANJI, --kanji=KANJI
    kanji character to search in corpus

    Help is available to the user for understanding the arguments.
  • Woody:CS297 sujata$ ./kanji_character_search.py
    Please specify a corpus file. See -h for help.

    An error message is displayed to the user if the user does not enter the Corpus filename.
  • Woody:CS297 sujata$ ./kanji_character_search.py -f test_corpus.txt
    Please specify a kanji/character to search. See -h for help.

    An error message is displayed to the user if the user does not enter the Japanese character/Kanji for the search.

To download deliverable 1, click: Download Deliverable 1