CS298 Proposal

Improved Chinese Language Processing for an Open Source Search Engine

Xianghong "Forrest" Sun (xianghong.sun@sjsu.edu)

Advisor: Dr. Chris Pollett

Committee Members: Dr. Robert Chun, Dr. Mike Wu

Abstract:

Yioop is an open source search engine. It supports many languages and has abilities to process and analyze text in different languages. Currently, there is some support for Chinese language processing in Yioop, such as Chinese text segmentation. However, the support for Chinese is limited and not very good. In this project, I am going to implement a better algorithm to segment Chinese text; I will implement an algorithm to do the part-of-speech tagging; and, I will also implement the Chinese question and answering system.

CS297 Results

Implemented a dictionary based word suggestion system in Yioop.
Implemented a Chinese Word segmentation Algorithm
Partially implemented a Chinese part-of-speech tagging system
Partially implemented a Chinese question and answering system

Proposed Schedule

Week 1: Jan 23 - Jan 28	First meeting, figuring out what to do to enhance the project. Discuss the enhancement of the word segmentation algorithm
Week 2: Jan 29 - Feb 5	Finish the final version of Chinese segmentation.
Week 3-4: Feb 5 - Feb 18	Enhance the POS tagging system. Work on the POS tagging formula and coding
Week 5-7: Feb 19 - Mar 10	Start the Question and Answering System
Week 8-11: Mar 11 - Apr 07	Entity Recognition System depends on previous results.
Week 12-16: Apr 08 - May 05	Review previous work and Complete project report and slides

Key Deliverables:

Software
- Enhance the Chinese Word segmentation Algorithm with memory optimization
- Enhance the Chinese POS tagging Algorithm with trained weight
- Implement a Chinese Question and Answering System
- Implement a Chinese Entity Recognition System
Report
- CS 298 Report
- CS 298 Presentation

Innovations and Challenges

In general, the area of Natural Language Processing is a combination of many different areas such as linguistics, computer science, information engineering, and artificial intelligence.
One of the most challenging parts is to implement Machine learning algorithms. The project needs to be implemented without external libraries. But, most of the existing works/papers are based on some very complicated machine learning models. So, I have to judge which models to use. It should be implementable as well as the results should look good.
Since the project is open source, there are some constraints. One of them is the limit usage of memory. The source code is written in PHP. And, PHP arrays consume a lot of memory because it is a dynamic language. I implemented cache so that not everything would be loaded into memory.
Different languages have different grammars. Chinese Sentences are much harder to deal with compared to English because they are less rigorous. Several Chinese sentences can be concatenated together with comma between them, which made them harder to seperate or to find the real subjective.

References:

Sproat, Richard & Shih, Chilin & Gale, William & Chang, Nancy. (2002). A Stochastic Finite-State Word-Segmentation Algorithm For Chinese. Computational Linguistics. 22. 10.3115/981732.981742.

Xue, Nianwen. Chinese Word Segmentation as Character Tagging. International Journal of Computational Linguistics and Chinese Language Processing, 8(1). 2003

Huihsin Tseng, Daniel Jurafsky, Christopher Manning. 2005. Discriminative Reordering with Chinese Grammatical Relations Features

Dan Jurafsky and James H. Martin, Draft chapters in progress, August 29, 2019 Speech and Language Processing (3rd ed. draft) CH 23: Question Answering

J Prager - Information Retrieval, 2006 Open-Domain Question-Answering

Mengqiu Wang and Christopher D. Manning. 2013. Cross-lingual Pseudo-Projected Expectation Regularization for Weakly Supervised Learning. Transactions of ACL 2013

Mengqiu Wang, Wanxiang Che and Christopher D. Manning. 2013. Joint Word Alignment and Bilingual Named Entity Recognition Using Dual Decomposition. ACL 2013

Mengqiu Wang, Wanxiang Che and Christopher D. Manning. 2013. Effective Bilingual Constraints for Semi-supervised Learning of Named Entity Recognizers. AAAI 2013

Wanxiang Che, Mengqiu Wang and Christopher D. Manning. 2013. Named Entity Recognition with Bilingual Constraints. NAACL 2013