CS299 Proposal
Classification of Web Pages in Yioop with Active Learning
Student |
Shawn Tice |
(shawn.cameron.bird@gmail.com) |
Advisor |
Dr. Chris Pollett |
(chris@pollett.org) |
Committee |
Dr. Mark Stamp |
(stamp@cs.sjsu.edu) |
|
Dr. Cay Horstmann |
(cay@horstmann.com) |
Abstract
This thesis will add to the Yioop search engine a general facility for
automatically assigning "class" meta tags (e.g.
"class:spam" or "class:resume") to web pages according to
the output of a supervised learning algorithm trained on labeled examples. In
order to minimize the burden on the administrator training the classifier, a
small "seed" set of hand-labeled documents will be used to bootstrap
a larger training corpus from unlabeled examples. The classifier built from the
seed training set will be used to assign a tentative label to unlabeled
training documents drawn from a previous crawl, and these decisions will then
be presented to the administrator for correction. The corrected examples will
be added to the corpus, the classifier retrained, and the process repeated
until the administrator decides to stop or there are no more training documents
to label.
CS297 Restults
In the Fall 2012 semester I accomplished the following:
- Developed a plan for how web page classification will work in
Yioop
- Explored the literature on text classification
- Implemented and experimented with Naive Bayes and Logistic Regression
classifiers
- Implemented and experimented with feature selection
- Wrote a report detailing my progress and findings
Schedule
Week |
Date |
Plan |
01 | Jan 23–29 |
Complete proposal and work on detailed plan for Deliverable
1 |
02 | Jan 30 to Feb 05 |
Deliverable 1—detailed plan; build test corpus from web
data |
03 | Feb 06–12 |
Add new activity tab; implement basic
create/update/delete |
04 | Feb 13–19 |
Implement crawl selection and iteration over documents |
05 | Feb 20–26 |
Deliverable 2—administrative interface |
06 | Feb 27 to Mar 05 |
Work on incorporating classifier framework; start on
report |
07–08 | Mar 06–19 |
Work on active learning framework incorporating user
feedback |
09 | Mar 20–26 |
Deliverable 3—complete classification system; draft
report for committee |
10 | Mar 27 to Apr 02 |
Revise draft and submit to GSO |
11–18 | Apr 03 to May 17 |
Deliverable 4—draft revised to comply with GSO
corrections; prepare for defense and polish classification system |
Deliverables
The primary deliverables are an implementation of the classification system
and a report, but each of these will be broken down into several
sub-deliverables:
- A detailed plan outlining the proposed changes to Yioop. The plan will
include use cases for the major features, a high-level discussion of
the algorithms to be used at each step, and a description of the Yioop
components that will need to be modified.
- Additions to Yioops web-based administrative interface
- Create/update/delete classifiers
- Select previous crawls to train on, and iterate over documents
- Display (fake) classification results, and elicit feedback
- Integration of the classifier framework into Yioop, providing full
functionality. The final product will include:
- The user interface
- Implementations of Naive Bayes and Logistic Regression
classifiers
- Implementations of several feature selection and weighting algorithms
- Implementation of an active learning framework which incorporates
user feedback to improve classifier accuracy
- Final report
- Draft of report for committee members
- Revised draft to be submitted to the graduate studies office for
review
- Final draft to be submitted for binding
References
[Brieman96] Bagging predictors. Breiman, L. Machine Learning,
24(2):123-140. 1996.
[Büttcher10] Information Retrieval: Implementing and
Evaluating Search Engines. Büttcher, Stefan, Clarke, Charles, and Cormack,
Gordon. The MIT Press. 2010.
[Crammer02] On the algorithmic implementation of multiclass
kernel-based vector machines. Crammer, K., and Singer, Y. Journal of Machine
Learning Research, 2:265-292. 2002.
[Langley92] An analysis of Bayesian classifiers. Langley, P.,
Iba, W., Thompson, K. In Proc. of the 10th National Conf. on Artificial
Intelligence, 1992.
[Lee08] A cluster-based resampling method for pseudo-relevance
feedback. Lee, K.S., Croft, W.B., and Allan, J. In Proc. of the 31st Ann.
Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval,
pages 235-242. 2008.
[McCallum98] Employing EM in pool-based active learning for text
classification. McCallum, A., Nigam, K. In Proc. of ICML-98, 15th International
Conference on Machine Learning. 1998.
[Nigam00] Text classification from labeled and unlabeled
documents using EM. Nigam, K., et al. Machine learning 39.2 (2000):
103-134.
[Ramoni01] Robust Bayes classifiers. Ramoni, M., Sebastiani, P.
Artificial Intelligence, 125: 209-226. 2001.
[Ruthven03] A survey on the use of relevance feedback for
information access systems. Ruthven, I., and Lalmas, M. Knowledge Engineering
Review, 18(2):95-145. 2003.
[Schapire00] BoosTexter: A boosting-based system for text
categorization. Schapire, R., and Singer, Y. Machine Learning, 39(2):135-168.
2000.
[Sebastiani02] Machine learning in automated text
categorization. Sebastiani, F. ACM Computing Surveys, 34(1):1-47. 2002.
[Tan06] Introduction to Data Mining. Tan, Pang-Ning, Steinbach,
M., Kumar, V. Addison Wesley. 2006.
[Wang08] A study of methods for negative relevance feedback.
Wang, X., Fang, H., and Zhai, C. In Proc. of the 31st Ann. Intl. ACM SIGIR
Conf. on Research and Development in Information Retrieval, pages 219-226.
2008. |