CS297 Proposal
Improving the BM25F algorithm for use with OPIC based crawlers.
Ravi Inder Singh Dhillon (ravi.dhillon@yahoo.com)
Advisor: Dr. Chris Pollett
Description:
This CS297 proposal aims at improving the BM25F algorithm for use with OPIC based crawlers. OPIC is an acronym for Online Page Importance Computation algorithm. In contrast to algorithms which compute the page rank offline using a strong link matrix, OPIC does an online computation of Page Importance or Page rank while crawling the web. BM25 is a ranking function used by search engines to rank matching documents according to their relevance to a given search query. It is based on the probabilistic retrieval framework developed in the 1970s and 1980s by Stephen E. Robertson, Karen Sparck Jones, and others. BM25F is a modification of BM25 in which the document is considered to be composed from several fields (such as headlines, main text, anchor text) with possibly different degrees of importance. Thus the page relevance is based on weights assigned to these fields. The open source search engine YIOOP being used as a case study appends all the link texts pointing to a web page at the end of it. If a lot of unimportant links are pointing to a webpage they can increase the page relevance of important web pages for unimportant word searches that are not relevant to it. Therefore we aim to improve BM25F by considering page rank computed by OPIC algorithm while computing weights for link text appended to web pages.
Schedule:
Week 1-2: 08/25/2010 - 09/05/2010 |
Discuss the project in detail with the advisor. |
Week 3-4: 09/06/2010 - 09/19/2010 |
Download and study the code for YIOOP. |
Week 5: 09/20/2010 - 09/26/2010 |
Deliverable 1: Perform a crawl using YIOOP and make changes to its UI. |
Week 6-7: 09/27/2010 - 10/10/2010 |
Read the [APC2003] paper on OPIC in reference section and deliver a presentation. |
Week 8: 10/11/2010 - 10/17/2010 |
Deliverable 2: Implement the OPIC algorithm using PHP |
Week 9-10: 10/18/2010 - 10/31/2010 |
Read the [ZCTSR2004] paper on BM25F in reference section and deliver a presentation. |
Week 11: 11/01/2010 - 11/07/2010 |
Deliverable 3: Implement the BM25F algorithm using PHP |
Week 12-13: 11/08/2010 - 11/21/2010 |
Study about weightings and their role in BM25F Page ranking |
Week 14: 11/22/2010 - 11/28/2010 |
Deliverable 4: Modify weighting strategy in YIOOP code. |
Week 15-17: 11/29/2010 - 12/12/2010 |
Deliverable 5: Final 297 Report |
Deliverables:
The full project will be done when CS298 is completed. The following will
be done by the end of CS297:
1. Perform a crawl using YIOOP and make changes to its UI.
2. Implement OPIC algorithm using PHP.
3. Implement BM25F algorithm using PHP.
4. Modify weighting strategy in YIOOP code.
5. Final 297 Report.
References:
[APC2003] Serge Abiteboul and Mihai Preda and Gregory Cobena. Adaptive on-line page importance computation. In: Proceedings of the 12th international conference on World Wide Web. pp.280-290. 2003.
[ZCTSR2004] Hugo Zaragoza, Nick Craswell, Michael Taylor, Suchi Saria, and Stephen Robertson. Microsoft Cambridge at TREC-13: Web and HARD tracks. In Proceedings of 3th Annual Text Retrieval Conference. 2004.
[BSV2004] Paolo Boldi and Massimo Santini and Sebastiano Vigna. Do Your Worst to Make the Best: Paradoxical Effects in PageRank Incremental Computations. Algorithms and Models for the Web-Graph. pp. 168180. 2004.
Google's PageRank and Beyond: The Science of Search Engine Rankings----by Amy N. Langville and Carl D. Meyer
Wikipedia |