CS298 Proposal
Improved Chinese Language Processing for an Open Source Search Engine
Xianghong "Forrest" Sun (xianghong.sun@sjsu.edu)
Advisor: Dr. Chris Pollett
Committee Members: Dr. Robert Chun, Dr. Mike Wu
Abstract:
Yioop is an open source search engine.
It supports many languages and has abilities to process and analyze text in different languages.
Currently, there is some support for Chinese language processing in Yioop, such as Chinese text segmentation.
However, the support for Chinese is limited and not very good.
In this project, I am going to implement a better algorithm to segment Chinese text; I will implement an algorithm to do the part-of-speech tagging; and, I will also implement the Chinese question and answering system.
CS297 Results
- Implemented a dictionary based word suggestion system in Yioop.
- Implemented a Chinese Word segmentation Algorithm
- Partially implemented a Chinese part-of-speech tagging system
- Partially implemented a Chinese question and answering system
Proposed Schedule
Week 1:
Jan 23 - Jan 28 | First meeting, figuring out what to do to enhance the project. Discuss the enhancement of the word segmentation algorithm |
Week 2:
Jan 29 - Feb 5 | Finish the final version of Chinese segmentation. |
Week 3-4:
Feb 5 - Feb 18 | Enhance the POS tagging system. Work on the POS tagging formula and coding |
Week 5-7:
Feb 19 - Mar 10 | Start the Question and Answering System |
Week 8-11:
Mar 11 - Apr 07 | Entity Recognition System depends on previous results. |
Week 12-16:
Apr 08 - May 05 | Review previous work and Complete project report and slides |
Key Deliverables:
- Software
- Enhance the Chinese Word segmentation Algorithm with memory optimization
- Enhance the Chinese POS tagging Algorithm with trained weight
- Implement a Chinese Question and Answering System
- Implement a Chinese Entity Recognition System
- Report
- CS 298 Report
- CS 298 Presentation
Innovations and Challenges
- In general, the area of Natural Language Processing is a combination of many different areas such as linguistics, computer science, information engineering, and artificial intelligence.
- One of the most challenging parts is to implement Machine learning algorithms. The project needs to be implemented without external libraries.
But, most of the existing works/papers are based on some very complicated machine learning models. So, I have to judge which models to use.
It should be implementable as well as the results should look good.
- Since the project is open source, there are some constraints. One of them is the limit usage of memory.
The source code is written in PHP. And, PHP arrays consume a lot of memory because it is a dynamic language.
I implemented cache so that not everything would be loaded into memory.
- Different languages have different grammars. Chinese Sentences are much harder to deal with compared to English because they are less rigorous.
Several Chinese sentences can be concatenated together with comma between them, which made them harder to seperate or to find the real subjective.
References:
Sproat, Richard & Shih, Chilin & Gale, William & Chang, Nancy. (2002). A Stochastic Finite-State Word-Segmentation Algorithm For Chinese. Computational Linguistics. 22. 10.3115/981732.981742.
Xue, Nianwen. Chinese Word Segmentation as Character Tagging. International Journal of Computational Linguistics and Chinese Language Processing, 8(1). 2003
Huihsin Tseng, Daniel Jurafsky, Christopher Manning. 2005.
Discriminative Reordering with Chinese Grammatical Relations Features
Dan Jurafsky and James H. Martin, Draft chapters in progress, August 29, 2019 Speech and Language Processing (3rd ed. draft) CH 23: Question Answering
J Prager - Information Retrieval, 2006 Open-Domain Question-Answering
Mengqiu Wang and Christopher D. Manning. 2013.
Cross-lingual Pseudo-Projected Expectation Regularization for Weakly Supervised Learning. Transactions of ACL 2013
Mengqiu Wang, Wanxiang Che and Christopher D. Manning. 2013.
Joint Word Alignment and Bilingual Named Entity Recognition Using Dual Decomposition. ACL 2013
Mengqiu Wang, Wanxiang Che and Christopher D. Manning. 2013.
Effective Bilingual Constraints for Semi-supervised Learning of Named Entity Recognizers. AAAI 2013
Wanxiang Che, Mengqiu Wang and Christopher D. Manning. 2013.
Named Entity Recognition with Bilingual Constraints. NAACL 2013
|