Chris Pollett > Students >
Forrest

    ( Print View)

    [Bio]

    [Blog]

    [CS 297 Proposal]

    [Deliverable1]

    [Deliverable2]

    [Deliverable3]

    [Deliverable4]

    [CS297 Report-PDF]

    [CS 298 Proposal]

    [CS298 Report-PDF]

    [CS298 slides-PDF]

CS297 Proposal

Improved Chinese Language Processing for an Open Source Search Engine

Xianghong "Forrest" Sun (xianghong.sun@sjsu.edu)

Advisor: Dr. Chris Pollett

Description:

Yioop is an open source search engine. It supports many languages and has abilities to process and analyze text. The current Chinese text segmentation system in Yioop uses backward maximum matching to segment Chinese text. However, the backward maximum matching algorithm has a poorer hit rate, compared to latest algorithms. My project goals will be to implement some better algorithms to analyze and extract the Chinese words from the text.
Here are some plans I'm going to implement this semester:
I will implement a Chinese text segmentation algorithm that extract Chinese words from a sentence.
I will implement an algorithm for part of speech (POS) tagging.
I will implement the Chinese Question and Answering System.
If time permits, I will implement a named entity recognition algorithm to support Chinese words segmentation.

Schedule:

Week 1: Aug 28 - Sep 3First Meeting and Write up proposal draft
Week 2: Sep 4 - Sep 10Look into the related part of Yioop source code, documentations
Week 3: Sep 11 - Sep 17Fixed a bug where Chinese stop words are not removed correctly
Week 4: Sep 18 - Sep 24Deliverable 1: Chinese word suggestion system on Yioop
Week 5: Sep 25 - Oct 1Read related documentations of Chinese Word segmentation (conditional random field (CRF) model)
Week 6: Oct 2 - Oct 8Continue
Week 7: Oct 9 - Oct 15Deliverable 2: Word segmentation Algorithm
Week 8: Oct 16 - Oct 22Read related documentations of part of speech for Chinese
Week 9: Oct 23 - Oct 29Continue
Week 10: Oct 30 - Nov 5Deliverable 3: POS tagging
Week 11: Nov 6 - Nov 12Read related documentations of Question and Answering System
Week 12: Nov 13 - Nov 19Continue
Week 13: Nov 20 - Nov 26Deliverable 4: QA System
Week 14: Nov 27 - Dec 3Working on report
Week 15: Dec 4 - Dec 10CS 297 report
Week 16: Dec 11 - Dec 18Report due

Deliverables:

The full project will be done when CS298 is completed. The following will be done by the end of CS297:

1. Implement a dictionary based word suggestion system on Yioop.

2. Implement a Chinese Word segmentaion Algorithm

3. Implement a part of speech tagging system

4. Implement a question and answering system

5. Final CS 297 Report.

References:

Chinese Natural Language Processing and Speech Processing by The Stanford Natural Language Processing Group

Conditional Random Fields:

John Lafferty, Andrew McCallum, and Fernando C.N. Pereira, "Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data", . June 2001.

Charles Sutton and Andrew McCallum, MIT Press, 2006 An Introduction to Conditional Random Fields

Chinese Segmentation:

Sproat, Richard & Shih, Chilin & Gale, William & Chang, Nancy. (2002). A Stochastic Finite-State Word-Segmentation Algorithm For Chinese. Computational Linguistics. 22. 10.3115/981732.981742.

Peng, Fuchun, "Chinese Segmentation and New Word Detection using Conditional Random Fields". Computer Science Department Faculty Publication Series. 92. 2004

Li, Shoushan & Huang, Chu-Ren. Word Boundary Decision with CRF for Chinese Word Segmentation. 2009

Xue, Nianwen. Chinese Word Segmentation as Character Tagging. International Journal of Computational Linguistics and Chinese Language Processing, 8(1). 2003

Zhao, Hai. et. "Effective Tag Set Selection in {C}hinese Word Segmentation via Conditional Random Field Modeling", Proceedings of the 20th Pacific Asia Conference on Language, Information and Computation. 2006

Part-of-Speech Tagging:

Huihsin Tseng, Daniel Jurafsky, Christopher Manning. 2005. Discriminative Reordering with Chinese Grammatical Relations Features

START Natural Language Question Answering System

Quora Question: How do I build a natural language question answering system?

Dan Jurafsky and James H. Martin, Draft chapters in progress, August 29, 2019 Speech and Language Processing (3rd ed. draft) CH 23: Question Answering

J Prager - Information Retrieval, 2006 Open-Domain Question-Answering

Named Entity:

Mengqiu Wang and Christopher D. Manning. 2013. Cross-lingual Pseudo-Projected Expectation Regularization for Weakly Supervised Learning. Transactions of ACL 2013

Mengqiu Wang, Wanxiang Che and Christopher D. Manning. 2013. Joint Word Alignment and Bilingual Named Entity Recognition Using Dual Decomposition. ACL 2013

Mengqiu Wang, Wanxiang Che and Christopher D. Manning. 2013. Effective Bilingual Constraints for Semi-supervised Learning of Named Entity Recognizers. AAAI 2013

Wanxiang Che, Mengqiu Wang and Christopher D. Manning. 2013. Named Entity Recognition with Bilingual Constraints. NAACL 2013