CS297 Proposal

User-Parameterized Classification of Web Pages in Yioop!

Shawn Tice

Advisor: Dr. Chris Pollett

Description: The Yioop! search engine can add meta tags to indexed pages at crawl time so that a user can limit a particular query to pages matching one or more meta tags. Presently these meta tags can be set by several hard-coded and user-configurable rules, but such methods require the user to know exactly what features to look for in a page's text or URL in order to assign a particular tag. This project will add a general facility for assigning "class" meta tags to web pages based on a supervised learning algorithm trained on user-supplied examples.

In order to minimize the burden on the person training the classifier, a small set of user-classified documents will be used to bootstrap a larger training corpus from previously-crawled pages. This process will proceed roughly as follows:

The user uploads some representative set of web pages (the seed pages) for each class.
The tool identifies the top keywords for each class and performs a query for those keywords on the existing index to get a set of candidate pages for classification. The tool then uses its current parameters to classify the candidate pages and discards any pages which cannot be determined to belong to some class with some threshold confidence. The threshold parameter is weighted by the size of the candidate set such that a larger set of candidates requires higher confidence that a page belongs to some class.
The tool presents the results of its initial classification to the user as a list of pages (maybe with snippets and a link to the page), the class that each page is supposed to belong to, and the confidence with which that page belongs to the class. Each result has a "correct" and an "incorrect" button, and the user clicks on one or the other to provide the classifier with feedback. When the user marks a page as "correct" then that page is added to the training corpus for the appropriate class.

At any point after providing the seed corpus (and optionally improving it via feedback) the user can initiate a new crawl or recrawl, and specify the classifier to be used. The classifier assigns each page crawled a class and a confidence score for that class as a meta tag.

Schedule:

I expect to add readings to this schedule as I work through my initial references, especially with regard to keyword extraction.

Week 1: Sep.03-10	Discuss plan and prior work. Presentation on classification methods.
Week 2: Sep.10-17	Background research. Reading: Ch. 10 in Büttcher10, Ch. 4 in Tan06, Sebastiani02. Demo of Naive Bayes classifier for section 2 and 3 man pages.
Week 3: Sep.17-24	Background research. Reading: Ch. 11 in Büttcher10, Ch. 5 in Tan06
Week 4: Sep.24 - Oct.01	Background research. Reading: Ch 8 in Büttcher10, Ruthven03, Langley92. Presentation on results from Naive Bayes classifier and readings.
Week 5: Oct.01-08	Background research. Reading: Lee08, Wang08, Ramoni01, Crammer02
Week 6: Oct.08-15	Deliverable 1: A report on the training process and potential algorithms (slides)
Week 7: Oct.15-22	Progress on classifier. Reading: Schapire00, Brieman96
Week 8: Oct.22-29	Progress on classifier
Week 9: Oct.29 - Nov.05	Progress on classifier
Week 10: Nov.05-12	Deliverable 2: An implementation of a classifier for web pages
Week 11: Nov.12-19	Progress on keyword extraction
Week 12: Nov.19-26	Progress on keyword extraction
Week 13: Nov.26 - Dec.03	Progress on keyword extraction
Week 14: Dec.03-10	Deliverable 3: An implementation of keyword extraction; Rough draft of report
Week 15: Dec.10-17	Deliverable 4: A report on progress made over the semester and on what work remains

Deliverables:

The full project will be done when CS298 is completed. The following will be done by the end of CS297:

A report on the training process and potential algorithms. This report will outline the use cases for training the classifier and running a crawl or recrawl with an active classifier. It will also include a review of relevant prior work in the areas of supervised text classification and keyword extraction, and a basic implementation of a representative text classification algorithm.
An implementation of a classifier for web pages. The classifier will be implemented as a PHP program independent of the Yioop! framework, and will be trained on a large hand-annotated corpus. It should classify the web pages of a test corpus (independent of the training corpus) into the learned classes with accuracy approaching the current state of the art.
An implementation of keyword extraction. The keyword extraction algorithm will be implemented as a PHP program independent of the Yioop! framework, and should extract the top N keywords from each document in a hand-annotated test corpus with accuracy approaching the current state of the art.
A report on progress made over the semester and on what work remains. The report will review the reading that I've done, discuss the performance of the classifier and keyword extraction tools that I've implemented, and lay out a plan for completing the project in the next semester.

References:

[Brieman96] Bagging predictors. Breiman, L. Machine Learning, 24(2):123-140. 1996.

[Büttcher10] Information Retrieval: Implementing and Evaluating Search Engines. Büttcher, Stefan, Clarke, Charles, and Cormack, Gordon. The MIT Press. 2010.

[Crammer02] On the algorithmic implementation of multiclass kernel-based vector machines. Crammer, K., and Singer, Y. Journal of Machine Learning Research, 2:265-292. 2002.

[Langley92] An analysis of Bayesian classifiers. Langley, P., Iba, W., Thompson, K. In Proc. of the 10th National Conf. on Artificial Intelligence, 1992.

[Lee08] A cluster-based resampling method for pseudo-relevance feedback. Lee, K.S., Croft, W.B., and Allan, J. In Proc. of the 31st Ann. Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 235-242. 2008.

[Ramoni01] Robust Bayes classifiers. Ramoni, M., Sebastiani, P. Artificial Intelligence, 125: 209-226. 2001.

[Ruthven03] A survey on the use of relevance feedback for information access systems. Ruthven, I., and Lalmas, M. Knowledge Engineering Review, 18(2):95-145. 2003.

[Schapire00] BoosTexter: A boosting-based system for text categorization. Schapire, R., and Singer, Y. Machine Learning, 39(2):135-168. 2000.

[Sebastiani02] Machine learning in automated text categorization. Sebastiani, F. ACM Computing Surveys, 34(1):1-47. 2002.

[Tan06] Introduction to Data Mining. Tan, Pang-Ning, Steinbach, M., Kumar, V. Addison Wesley. 2006.

[Wang08] A study of methods for negative relevance feedback. Wang, X., Fang, H., and Zhai, C. In Proc. of the 31st Ann. Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 219-226. 2008.