Chris Pollett >
Students >
Shawn [Bio] [299 Blog] |
CS299 ProposalClassification of Web Pages in Yioop with Active Learning
AbstractThis thesis will add to the Yioop search engine a general facility for automatically assigning "class" meta tags (e.g. "class:spam" or "class:resume") to web pages according to the output of a supervised learning algorithm trained on labeled examples. In order to minimize the burden on the administrator training the classifier, a small "seed" set of hand-labeled documents will be used to bootstrap a larger training corpus from unlabeled examples. The classifier built from the seed training set will be used to assign a tentative label to unlabeled training documents drawn from a previous crawl, and these decisions will then be presented to the administrator for correction. The corrected examples will be added to the corpus, the classifier retrained, and the process repeated until the administrator decides to stop or there are no more training documents to label. CS297 RestultsIn the Fall 2012 semester I accomplished the following:
Schedule
DeliverablesThe primary deliverables are an implementation of the classification system and a report, but each of these will be broken down into several sub-deliverables:
References[Brieman96] Bagging predictors. Breiman, L. Machine Learning, 24(2):123-140. 1996. [Büttcher10] Information Retrieval: Implementing and Evaluating Search Engines. Büttcher, Stefan, Clarke, Charles, and Cormack, Gordon. The MIT Press. 2010. [Crammer02] On the algorithmic implementation of multiclass kernel-based vector machines. Crammer, K., and Singer, Y. Journal of Machine Learning Research, 2:265-292. 2002. [Langley92] An analysis of Bayesian classifiers. Langley, P., Iba, W., Thompson, K. In Proc. of the 10th National Conf. on Artificial Intelligence, 1992. [Lee08] A cluster-based resampling method for pseudo-relevance feedback. Lee, K.S., Croft, W.B., and Allan, J. In Proc. of the 31st Ann. Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 235-242. 2008. [McCallum98] Employing EM in pool-based active learning for text classification. McCallum, A., Nigam, K. In Proc. of ICML-98, 15th International Conference on Machine Learning. 1998. [Nigam00] Text classification from labeled and unlabeled documents using EM. Nigam, K., et al. Machine learning 39.2 (2000): 103-134. [Ramoni01] Robust Bayes classifiers. Ramoni, M., Sebastiani, P. Artificial Intelligence, 125: 209-226. 2001. [Ruthven03] A survey on the use of relevance feedback for information access systems. Ruthven, I., and Lalmas, M. Knowledge Engineering Review, 18(2):95-145. 2003. [Schapire00] BoosTexter: A boosting-based system for text categorization. Schapire, R., and Singer, Y. Machine Learning, 39(2):135-168. 2000. [Sebastiani02] Machine learning in automated text categorization. Sebastiani, F. ACM Computing Surveys, 34(1):1-47. 2002. [Tan06] Introduction to Data Mining. Tan, Pang-Ning, Steinbach, M., Kumar, V. Addison Wesley. 2006. [Wang08] A study of methods for negative relevance feedback. Wang, X., Fang, H., and Zhai, C. In Proc. of the 31st Ann. Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 219-226. 2008. |