Chris Pollett > Students >
Sujata

    ( Print View )

    [Bio]

    [Project Blog]

    [CS297 Proposal]

    [TheoryOfComputing Slides-PDF]

    [Deliverable 1]

    [Deliverable 2-PDF]

    [Deliverable 3]

    [Deliverable 4]

    [CS297 Report-PDF]

    [CS298 Proposal]

    [Sub-deliverable 1]

    [Sub-deliverable 2]

    [CS298 Report-PDF]

    [CS298 Presentation Slides-PDF]

    [CS298 Project Code]

                          

























CS297-298 Project News Feed

Meeting Date Discussion Topics Suggestions TO DO
Nov 16, 2010 Implement HMM program into nutch web crawler Dr. Pollett reviewed first draft for CS298 report and suggested changes
  • Incorporate HMM program into nutch
  • Finish first draft of CS298 report
  • Write program to save HMM final matrix into file and use that in all experiment programs
Nov 9, 2010 Implement HMM program into nutch web crawler Dr. Pollett suggested few changes to add HMM program into the nutch web crawler
  • Incorporate HMM program into nutch
  • Start writing experiments and report
  • Write program to save HMM final matrix into file and use that in all experiment programs
Nov 2, 2010 Discuss about nutch web crawler Dr. Pollett suggested few changes to make the nutch web crawler work
  • Try running nutch web crawler on windows platform
  • Start writing experiments and report
  • Write program to save HMM final matrix into file and use that in all experiment programs
Oct 26, 2010 Discuss how to merge all experiments and about front-end to show results Download nutch web crawler. Dr. Pollett asked me to start writing experiments and report
  • Download nutch web crawler
  • Start writing experiments and report
  • Write program to save HMM final matrix into file and use that in all experiment programs
  • Done with experiment 2 bug. All experiments are working now
Oct 19, 2010 Make changes in the experiments Dr. Pollett suggested few experiments using n-gram approach
  • Work on experiments
  • Experiment 1: Sort the binary search tree on number of occurrences of a particular word
  • Experiment 2: Make changes in the code. Assume user enters two characters C1 and C2, if its count is positive, give this to HMM to find out what C3 would be else ignore that string
  • Experiment 3: done. How to make changes in this program?
  • Read existing crawlers, download and experiment with existing crawlers
Oct 12, 2010 Make changes in the experiments Dr. Pollett suggested few experiments using n-gram approach
  • Work on experiments
  • Experiment 1: Make changes in the code. Use binary search tree to store word as key and its value as number of occurrences in the corpus file
  • Experiment 2: Make changes in the code. Use substring, array slice of K = 2
  • Experiment 3: done
  • Read existing crawlers, download and experiment with existing crawlers
Oct 5, 2010 Different experiments with HMM Japanese Parser and Tanaka Corpus Dr. Pollett suggested few experiments using n-gram approach
  • Work on experiments
  • Experiment 1: Read characters in Tanaka Corpus file assuming window size of 2. This experiment is useful to suggest what will be the next character of user input string
  • Experiment 2: Read characters in Tanaka Corpus file such that if special character is found then add 1 to count else subtract 1. This experiment is useful to detect the end of the word
  • Experiment 3: Read characters in Tanaka Corpus file unless special character is found. This experiment is useful to make one dictionary of japanese words using corpus file
  • Read existing crawlers, download and experiment with existing crawlers
Sept 28, 2010 Different experiments and making program more flexible for user input Dr. Pollett suggested few experiments one is using HMM model and other using n-gram approach
  • Use generics as per JDK1.6 standards
  • Modify program for first experiment: Accept user input and search string in Tanaka Corpus
  • Modify program for second experiment: Accept user input and replace some of the characters and display the suggested string
  • Assuming window size of 2, experiment with n-gram
Sept 21, 2010 How to check which character to choose to show to the user Store probabilities in an array and sort the array and print it. Depending on probabilities, make right choices about the letters to be shown to the user as the suggested string. Use Tanaka Corpus to decide which one is better
  • Add sorted array functionality
  • Print sentences form Tanaka Corpus file with the highest probability string
Sept 14, 2010 Modifications in Viterbi program Modify viterbi program to accept input form the user. Append 191 japanese characters at the beginning, in the middle and at the end of the user input and run viterbi program on all these combinations. Get the string with the highest probability from viterbi program and print sentences containing this string from Tanaka Corpus file using command line
  • Modify viterbi program as suggested
  • Print sentences form Tanaka Corpus file with the highest probability string
  • Upload sub-deliverables
Sept 7, 2010 Viterbi program With given HMM model for japanese language parsing, check the highest probability and path with the highest probability
  • Work on viterbi program, check if it is outputting correct probability and path
Aug 31, 2010 Analyzing probabilities after running HMM on Japanese text corpus Come up with some rules of state transitions after analyzing the probabilities. Use that rule to detect word boundaries
  • With the given HMM and user input string, write a program that will output the string with the highest probability
May 12, 2010 Discuss about analyzing japanese corpus file and work to be done in future Work on japanese characters in HMM training algorithm
  • Complete HMM training program for japanese characters
May 5, 2010 Discuss about analyzing japanese corpus file Update HMM training program for japanese characters
  • Read Japanese characters from the corpus file. Assign all hiragana characters to different numbers of observations. Assign all katakana characters to different numbers of observations. Assign all kanjis to only one observation
  • Yahoo search API code is working
Apr 28, 2010 Discuss about the HMM training program and how to implement search functionality HMM training program is working now
  • Read Japanese characters from the corpus file. Assign all hiragana characters to different numbers of observations. Assign all katakana characters to different numbers of observations. Assign all kanjis to only one observation
  • Implement search functionality using Google search APIs or Yahoo search APIs
Apr 21, 2010 Discuss about the HMM training program and how to implement search functionality HMM training program is working now
  • Change HMM training program for Japanese corpus
  • Implement search functionality using Google search APIs or Yahoo search APIs
Apr 14, 2010 Discuss about the tested HMM training program We found out that the total number of characters is less than 50000 in file read and write code
  • Replicate the results from Dr. Mark Stamp's paper by hard coding initial probabilities
  • Use these probabilities to check if the HMM is converging only in first iteration or not
Apr 7, 2010 Discuss about the tested HMM training program Dr. Pollett checked the HMM training algorithm program. He checked the reading and writing of observation sequence file. We debugged the program and found out that there might be some problem with file read and write
  • Test file reading and writing code
Mar 24, 2010 Discuss about the log probabilities in the HMM training program Dr. Pollett checked the HMM training algorithm program and suggested to debug the code just for 2 iterations. He asked to calculate all the probabilities manually to make sure that program is bug free
  • Check for the bugs in a program by iterating the program for two iterations
Mar 17, 2010 Discuss about the program for the HMM training algorithm Dr. Pollett checked the HMM training algorithm program and suggested for few changes according to the Dr. Mark Stamp's paper
  • Check for the bugs in a program and try to make it work for English text using the Brown Corpus file
Mar 10, 2010 Discuss about the program for the HMM training algorithm Dr. Pollett checked the HMM training algorithm program and suggested for few changes according to the Dr. Mark Stamp's paper
  • Make changes in HMM training algorithm program referring to Dr. Mark Stamp's paper
Mar 3, 2010 Discuss about the program for the HMM training algorithm Dr. Pollett suggested to meet Professor Stamp and get some advice from him about HMM training algorithm
  • Make changes in HMM training program for calculating count C and hence the probability
  • Professor Mark Stamp asked me to read his paper about HMM. He also suggested to think about the total number of characters to be considered for HMM training algorithm
Feb 24, 2010 Discuss about the program for the HMM training algorithm Dr. Pollett suggested few changes in the program for calculating transition probabilities
  • Make changes in HMM training program for calculating count C and hence the probability
Feb 17, 2010 Discuss about the program for the HMM training algorithm Dr. Pollett suggested few changes in the program for calculating transition probabilities
  • Make changes in HMM training program for calculating count C and hence the probability
Feb 10, 2010 Discuss about the program for the HMM training algorithm Dr. Pollett explained the steps for implementing HMM training algorithm in details
  • Complete the HMM training program by implementing the steps
Feb 3, 2010 Discuss about the program for the HMM training algorithm Dr. Pollett explained the steps for implementing HMM training algorithm. He also reviewed the program written by me
  • Complete the HMM training program by implementing the steps
Jan 27, 2010 Decide the meeting time for CS298
  • Start working on HMM training algorithm
Dec 1, 2009 Discuss about all the deliverables and CS297 report Dr. Pollett reviewed all the deliverables and suggested few changes in some of the PDF files
  • Make changes in deliverable 2
  • Take printout of CS297 report and submit
  • Check all the pages validate as XHTML 1.1. Also do full check using Acrobat Pro
  • Prepare CS298 proposal and decide on committee members
Nov 24, 2009 Discuss about deliverable 4 Dr. Pollett suggested few changes to resolve the installation errors of MySQL N-gram parser plugin
  • Prepare deliverable 4 for MySQL N-gram parser installation experiment.
  • Prepare CS297 Report.
Nov 17, 2009 Discuss about HMM training Dr. Pollett explained me HMM training algorithm.
  • Add probability calculation table for HMM training in deliverable 2
  • Experiment with MySQL full text plugin for Japanese language
Nov 10, 2009 Discuss deliverable 3 and deliverable 4 Dr. Pollett verified the HMM example from deliverable 2. He asked me to add HMM learning algorithm in deliverable 3 and start to work on deliverable 4.
  • Make slides/PDF file for HMM learning algorithm: deliverable 3 changes
  • Search for MySQL full text search for Japanese language: deliverable 4
Nov 2, 2009 Discuss deliverable 3 I asked few queries about the HMM model example in deliverable 2. Dr. Pollett explained me about the transition probabilities and emission probabilities and why is it required to consider the emission probabilities.
  • Make changes in deliverable 2 HMM model example.
  • Understand HMM learning/training.
  • Upload deliverable 3 of Viterbi and Forward Viterbi algorithm programs.
Oct 27, 2009 Discuss deliverable 3 Dr. Pollett suggested some changes in example for Viterbi algorithm. Also he asked me to understand HMM training algorithm and write a program for Viterbi and Forward Viterbi algorithms
  • Make changes in deliverable 2 report.
  • Understand HMM learning/training.
  • Write programs for Viterbi and Forward Viterbi algorithms.
Oct 20, 2009 Finalizing contents of deliverable 2 and discuss about deliverable 3 Dr. Pollett reviewed the example for HMM and Viterbi algorithm. He explained me the difference between Viterbi and Forward Viterbi algorithms.
  • Make changes in deliverable 2 report.
  • Understand HMM learning/training.
  • Write programs for Viterbi and Forward Viterbi algorithms.
Oct 6, 2009 Progress about deliverable 2 and discuss algorithms for parsing Japanese text Dr. Pollett suggested few changes in the HMM report. He asked me to explain with my own example on HMM and Viterbi algorithm. Then he suggested me to start working on deliverable 3 by understanding the Viterbi algorithm and writing a program for it.
  • Make changes in deliverable 2 report.
  • Prepare slides for chapters 2,3 and 4 from SLP.
  • Start working on deliverable 3:Program for Viterbi algorithm.
Sept 29, 2009 Progress about deliverable 2 and Japanese parsing techniques There are two parsers used for japanese text such as Chasen Morphological Analyzer and MeCab. Chasen is based on Hidden Markov Model and MeCab is based on CRFs. Dr.Pollett explained me some of the concepts in NLP such as entropy. He asked me to read and understand what HMMs and CRFs, why and how they work?
  • Read second and third chapter from SLP and prepare slides.
  • Read and get better understanding of HMMs and CRFs.
  • Write a report on HMMs and CRFs.
  • Work on Deliverable 2
Sept 15, 2009 Progress about deliverable 1 and Japanese parsing techniques Dr.Pollett suggested few changes in the Theory of Computing slides. He also suggested me few solutions for developing a program to remove english sentences from Tanaka Corpus file. Then we discussed about deliverable 2. Dr.Pollett asked me to find out different techniques that are used for parsing japanese text.
  • Put new program for removing english sentences from Tanaka Corpus file.
  • Update Theory of Computing slides with examples.
  • Read second chapter from SLP and prepare slides
  • Describe more about Tanaka Corpus in deliverable 1. Make changes in the link tag.
  • Find out techniques used for japanese text.
  • Work on Deliverable 2
Sept 8, 2009 Japanese Corpus Dr.Pollett asked me to read first chapter from the book SLP and prepare slides for the same. He also gave me Theory of Computing book for understanding Finite Automata concepts. After that we discussed about Kyoto Text Corpus and Tanaka Text Corpus. It is not possible to check working for Kyoto Text Corpus as it requires to purchase a CD. Tanaka Text Corpus works ok. Dr.Pollett suggested me to make changes in the existing Tanaka Text Corpus file and write a program that will take some kanji as input and display all the lines from the file containing that kanji character.
  • Write a program: input = kanji or any japanese character, output = Lines containing that kanji/character.
  • Read first chapter from SLP and prepare slides
  • Read some sections from Theory of Computing book and prepare slides
  • Work on Deliverable 1
Aug 26, 2009 Initial proposal Dr.Pollett suggested few changes in the initial proposal. He reviewed description of the project and suggested few changes about the purpose of the project. Dr. Pollett also asked me to refer to the Statistical Language Processing book and some of the Japanese corpuses. Make required changes in the proposal and submit final copy to CS department. Start reading Statistical Language Learning book. Search for Japanese corpuses.