Chris Pollett >
Students > [Bio] [Blog] |
CS297 ProposalImproved Chinese Language Processing for an Open Source Search EngineXianghong "Forrest" Sun (xianghong.sun@sjsu.edu) Advisor: Dr. Chris Pollett Description:
Yioop is an open source search engine.
It supports many languages and has abilities to process and analyze text.
The current Chinese text segmentation system in Yioop uses backward maximum matching to segment Chinese text.
However, the backward maximum matching algorithm has a poorer hit rate, compared to latest algorithms.
My project goals will be to implement some better algorithms to analyze and extract the Chinese words from the text. Schedule:
Deliverables: The full project will be done when CS298 is completed. The following will be done by the end of CS297: 1. Implement a dictionary based word suggestion system on Yioop. 2. Implement a Chinese Word segmentaion Algorithm 3. Implement a part of speech tagging system 4. Implement a question and answering system 5. Final CS 297 Report. References: Conditional Random Fields: John Lafferty, Andrew McCallum, and Fernando C.N. Pereira, "Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data", . June 2001. Charles Sutton and Andrew McCallum, MIT Press, 2006 An Introduction to Conditional Random Fields Chinese Segmentation: Sproat, Richard & Shih, Chilin & Gale, William & Chang, Nancy. (2002). A Stochastic Finite-State Word-Segmentation Algorithm For Chinese. Computational Linguistics. 22. 10.3115/981732.981742. Peng, Fuchun, "Chinese Segmentation and New Word Detection using Conditional Random Fields". Computer Science Department Faculty Publication Series. 92. 2004 Li, Shoushan & Huang, Chu-Ren. Word Boundary Decision with CRF for Chinese Word Segmentation. 2009 Xue, Nianwen. Chinese Word Segmentation as Character Tagging. International Journal of Computational Linguistics and Chinese Language Processing, 8(1). 2003 Zhao, Hai. et. "Effective Tag Set Selection in {C}hinese Word Segmentation via Conditional Random Field Modeling", Proceedings of the 20th Pacific Asia Conference on Language, Information and Computation. 2006 Part-of-Speech Tagging: Huihsin Tseng, Daniel Jurafsky, Christopher Manning. 2005. Discriminative Reordering with Chinese Grammatical Relations Features START Natural Language Question Answering System Quora Question: How do I build a natural language question answering system? Dan Jurafsky and James H. Martin, Draft chapters in progress, August 29, 2019 Speech and Language Processing (3rd ed. draft) CH 23: Question Answering J Prager - Information Retrieval, 2006 Open-Domain Question-Answering Named Entity: Mengqiu Wang and Christopher D. Manning. 2013. Cross-lingual Pseudo-Projected Expectation Regularization for Weakly Supervised Learning. Transactions of ACL 2013 Mengqiu Wang, Wanxiang Che and Christopher D. Manning. 2013. Joint Word Alignment and Bilingual Named Entity Recognition Using Dual Decomposition. ACL 2013 Mengqiu Wang, Wanxiang Che and Christopher D. Manning. 2013. Effective Bilingual Constraints for Semi-supervised Learning of Named Entity Recognizers. AAAI 2013 Wanxiang Che, Mengqiu Wang and Christopher D. Manning. 2013. Named Entity Recognition with Bilingual Constraints. NAACL 2013 |