CS297 Proposal
User-Parameterized Classification of Web Pages in Yioop!
Shawn Tice
Advisor: Dr. Chris Pollett
Description: The Yioop! search engine can add meta tags to indexed
pages at crawl time so that a user can limit a particular query to pages
matching one or more meta tags. Presently these meta tags can be set by several
hard-coded and user-configurable rules, but such methods require the user to
know exactly what features to look for in a page's text or URL in order to
assign a particular tag. This project will add a general facility for assigning
"class" meta tags to web pages based on a supervised learning
algorithm trained on user-supplied examples.
In order to minimize the burden on the person training the classifier, a
small set of user-classified documents will be used to bootstrap a larger
training corpus from previously-crawled pages. This process will proceed
roughly as follows:
- The user uploads some representative set of web pages (the seed pages)
for each class.
- The tool identifies the top keywords for each class and performs a
query for those keywords on the existing index to get a set of candidate
pages for classification. The tool then uses its current parameters to
classify the candidate pages and discards any pages which cannot be
determined to belong to some class with some threshold confidence. The
threshold parameter is weighted by the size of the candidate set such that
a larger set of candidates requires higher confidence that a page belongs
to some class.
- The tool presents the results of its initial classification to the user
as a list of pages (maybe with snippets and a link to the page), the class
that each page is supposed to belong to, and the confidence with which that
page belongs to the class. Each result has a "correct" and an
"incorrect" button, and the user clicks on one or the other to
provide the classifier with feedback. When the user marks a page as
"correct" then that page is added to the training corpus for the
appropriate class.
At any point after providing the seed corpus (and optionally improving it
via feedback) the user can initiate a new crawl or recrawl, and specify the
classifier to be used. The classifier assigns each page crawled a class and a
confidence score for that class as a meta tag.
Schedule:
I expect to add readings to this schedule as I
work through my initial references, especially with regard to keyword
extraction.
Week 1: Sep.03-10 |
Discuss plan and prior work. Presentation on
classification methods. |
Week 2: Sep.10-17 |
Background research. Reading: Ch. 10 in Büttcher10,
Ch. 4 in Tan06, Sebastiani02. Demo of Naive Bayes classifier for
section 2 and 3 man pages. |
Week 3: Sep.17-24 |
Background research. Reading: Ch. 11 in Büttcher10,
Ch. 5 in Tan06 |
Week 4: Sep.24 - Oct.01 |
Background research. Reading: Ch 8 in Büttcher10,
Ruthven03, Langley92. Presentation on results from Naive Bayes
classifier and readings. |
Week 5: Oct.01-08 |
Background research. Reading: Lee08, Wang08, Ramoni01,
Crammer02 |
Week 6: Oct.08-15 |
Deliverable 1: A report
on the training process and potential algorithms (slides) |
Week 7: Oct.15-22 |
Progress on classifier. Reading: Schapire00, Brieman96 |
Week 8: Oct.22-29 |
Progress on classifier |
Week 9: Oct.29 - Nov.05 |
Progress on classifier |
Week 10: Nov.05-12 |
Deliverable 2: An implementation of a classifier for web
pages |
Week 11: Nov.12-19 |
Progress on keyword extraction |
Week 12: Nov.19-26 |
Progress on keyword extraction |
Week 13: Nov.26 - Dec.03 |
Progress on keyword extraction |
Week 14: Dec.03-10 |
Deliverable 3: An implementation of keyword extraction;
Rough draft of report |
Week 15: Dec.10-17 |
Deliverable 4: A report on progress made over the semester
and on what work remains |
Deliverables:
The full project will be done when CS298 is completed. The following will
be done by the end of CS297:
- A report on the training process and potential algorithms. This report
will outline the use cases for training the classifier and running a crawl
or recrawl with an active classifier. It will also include a review of
relevant prior work in the areas of supervised text classification and
keyword extraction, and a basic implementation of a representative text
classification algorithm.
- An implementation of a classifier for web pages. The classifier will be
implemented as a PHP program independent of the Yioop! framework, and will
be trained on a large hand-annotated corpus. It should classify the web
pages of a test corpus (independent of the training corpus) into the
learned classes with accuracy approaching the current state of the
art.
- An implementation of keyword extraction. The keyword extraction
algorithm will be implemented as a PHP program independent of the Yioop!
framework, and should extract the top N keywords from each
document in a hand-annotated test corpus with accuracy approaching the
current state of the art.
- A report on progress made over the semester and on what work
remains. The report will review the reading that I've done, discuss the
performance of the classifier and keyword extraction tools that I've
implemented, and lay out a plan for completing the project in the next
semester.
References:
[Brieman96] Bagging predictors. Breiman, L. Machine Learning,
24(2):123-140. 1996.
[Büttcher10] Information Retrieval: Implementing and
Evaluating Search Engines. Büttcher, Stefan, Clarke, Charles, and Cormack,
Gordon. The MIT Press. 2010.
[Crammer02] On the algorithmic implementation of multiclass
kernel-based vector machines. Crammer, K., and Singer, Y. Journal of Machine
Learning Research, 2:265-292. 2002.
[Langley92] An analysis of Bayesian classifiers. Langley, P.,
Iba, W., Thompson, K. In Proc. of the 10th National Conf. on Artificial
Intelligence, 1992.
[Lee08] A cluster-based resampling method for pseudo-relevance
feedback. Lee, K.S., Croft, W.B., and Allan, J. In Proc. of the 31st Ann.
Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval,
pages 235-242. 2008.
[Ramoni01] Robust Bayes classifiers. Ramoni, M., Sebastiani, P.
Artificial Intelligence, 125: 209-226. 2001.
[Ruthven03] A survey on the use of relevance feedback for
information access systems. Ruthven, I., and Lalmas, M. Knowledge Engineering
Review, 18(2):95-145. 2003.
[Schapire00] BoosTexter: A boosting-based system for text
categorization. Schapire, R., and Singer, Y. Machine Learning, 39(2):135-168.
2000.
[Sebastiani02] Machine learning in automated text
categorization. Sebastiani, F. ACM Computing Surveys, 34(1):1-47. 2002.
[Tan06] Introduction to Data Mining. Tan, Pang-Ning, Steinbach,
M., Kumar, V. Addison Wesley. 2006.
[Wang08] A study of methods for negative relevance feedback.
Wang, X., Fang, H., and Zhai, C. In Proc. of the 31st Ann. Intl. ACM SIGIR
Conf. on Research and Development in Information Retrieval, pages 219-226.
2008.
|