CS299 Proposal

Classification of Web Pages in Yioop with Active Learning

Student	Shawn Tice	(shawn.cameron.bird@gmail.com)
Advisor	Dr. Chris Pollett	(chris@pollett.org)
Committee	Dr. Mark Stamp	(stamp@cs.sjsu.edu)
	Dr. Cay Horstmann	(cay@horstmann.com)

Abstract

This thesis will add to the Yioop search engine a general facility for automatically assigning "class" meta tags (e.g. "class:spam" or "class:resume") to web pages according to the output of a supervised learning algorithm trained on labeled examples. In order to minimize the burden on the administrator training the classifier, a small "seed" set of hand-labeled documents will be used to bootstrap a larger training corpus from unlabeled examples. The classifier built from the seed training set will be used to assign a tentative label to unlabeled training documents drawn from a previous crawl, and these decisions will then be presented to the administrator for correction. The corrected examples will be added to the corpus, the classifier retrained, and the process repeated until the administrator decides to stop or there are no more training documents to label.

CS297 Restults

In the Fall 2012 semester I accomplished the following:

Developed a plan for how web page classification will work in Yioop
Explored the literature on text classification
Implemented and experimented with Naive Bayes and Logistic Regression classifiers
Implemented and experimented with feature selection
Wrote a report detailing my progress and findings

Schedule

Week	Date	Plan
01	Jan 23–29	Complete proposal and work on detailed plan for Deliverable 1
02	Jan 30 to Feb 05	Deliverable 1—detailed plan; build test corpus from web data
03	Feb 06–12	Add new activity tab; implement basic create/update/delete
04	Feb 13–19	Implement crawl selection and iteration over documents
05	Feb 20–26	Deliverable 2—administrative interface
06	Feb 27 to Mar 05	Work on incorporating classifier framework; start on report
07–08	Mar 06–19	Work on active learning framework incorporating user feedback
09	Mar 20–26	Deliverable 3—complete classification system; draft report for committee
10	Mar 27 to Apr 02	Revise draft and submit to GSO
11–18	Apr 03 to May 17	Deliverable 4—draft revised to comply with GSO corrections; prepare for defense and polish classification system

Deliverables

The primary deliverables are an implementation of the classification system and a report, but each of these will be broken down into several sub-deliverables:

A detailed plan outlining the proposed changes to Yioop. The plan will include use cases for the major features, a high-level discussion of the algorithms to be used at each step, and a description of the Yioop components that will need to be modified.
Additions to Yioops web-based administrative interface
1. Create/update/delete classifiers
2. Select previous crawls to train on, and iterate over documents
3. Display (fake) classification results, and elicit feedback
Integration of the classifier framework into Yioop, providing full functionality. The final product will include:
1. The user interface
2. Implementations of Naive Bayes and Logistic Regression classifiers
3. Implementations of several feature selection and weighting algorithms
4. Implementation of an active learning framework which incorporates user feedback to improve classifier accuracy
Final report
1. Draft of report for committee members
2. Revised draft to be submitted to the graduate studies office for review
3. Final draft to be submitted for binding

References

[Brieman96] Bagging predictors. Breiman, L. Machine Learning, 24(2):123-140. 1996.

[Büttcher10] Information Retrieval: Implementing and Evaluating Search Engines. Büttcher, Stefan, Clarke, Charles, and Cormack, Gordon. The MIT Press. 2010.

[Crammer02] On the algorithmic implementation of multiclass kernel-based vector machines. Crammer, K., and Singer, Y. Journal of Machine Learning Research, 2:265-292. 2002.

[Langley92] An analysis of Bayesian classifiers. Langley, P., Iba, W., Thompson, K. In Proc. of the 10th National Conf. on Artificial Intelligence, 1992.

[Lee08] A cluster-based resampling method for pseudo-relevance feedback. Lee, K.S., Croft, W.B., and Allan, J. In Proc. of the 31st Ann. Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 235-242. 2008.

[McCallum98] Employing EM in pool-based active learning for text classification. McCallum, A., Nigam, K. In Proc. of ICML-98, 15th International Conference on Machine Learning. 1998.

[Nigam00] Text classification from labeled and unlabeled documents using EM. Nigam, K., et al. Machine learning 39.2 (2000): 103-134.

[Ramoni01] Robust Bayes classifiers. Ramoni, M., Sebastiani, P. Artificial Intelligence, 125: 209-226. 2001.

[Ruthven03] A survey on the use of relevance feedback for information access systems. Ruthven, I., and Lalmas, M. Knowledge Engineering Review, 18(2):95-145. 2003.

[Schapire00] BoosTexter: A boosting-based system for text categorization. Schapire, R., and Singer, Y. Machine Learning, 39(2):135-168. 2000.

[Sebastiani02] Machine learning in automated text categorization. Sebastiani, F. ACM Computing Surveys, 34(1):1-47. 2002.

[Tan06] Introduction to Data Mining. Tan, Pang-Ning, Steinbach, M., Kumar, V. Addison Wesley. 2006.

[Wang08] A study of methods for negative relevance feedback. Wang, X., Fang, H., and Zhai, C. In Proc. of the 31st Ann. Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 219-226. 2008.