Chris Pollett > Students >
Ravi

    ( Print View )

    [Bio]

    [Project Blog]

    [CS297 Proposal]

    [Deliverable1]

    [Deliverable2]

    [Deliverable3]

    [CS297 Report - PDF]

    [CS298 Proposal]

    [CS298 Progress Report Spring 2011- PDF]

    [CS298 Project Report - PDF]

                          

























CS297 Proposal

Improving the BM25F algorithm for use with OPIC based crawlers.

Ravi Inder Singh Dhillon (ravi.dhillon@yahoo.com)

Advisor: Dr. Chris Pollett

Description:

This CS297 proposal aims at improving the BM25F algorithm for use with OPIC based crawlers. OPIC is an acronym for Online Page Importance Computation algorithm. In contrast to algorithms which compute the page rank offline using a strong link matrix, OPIC does an online computation of Page Importance or Page rank while crawling the web. BM25 is a ranking function used by search engines to rank matching documents according to their relevance to a given search query. It is based on the probabilistic retrieval framework developed in the 1970s and 1980s by Stephen E. Robertson, Karen Sparck Jones, and others. BM25F is a modification of BM25 in which the document is considered to be composed from several fields (such as headlines, main text, anchor text) with possibly different degrees of importance. Thus the page relevance is based on weights assigned to these fields. The open source search engine YIOOP being used as a case study appends all the link texts pointing to a web page at the end of it. If a lot of unimportant links are pointing to a webpage they can increase the page relevance of important web pages for unimportant word searches that are not relevant to it. Therefore we aim to improve BM25F by considering page rank computed by OPIC algorithm while computing weights for link text appended to web pages.

Schedule:

Week 1-2: 08/25/2010 - 09/05/2010

Discuss the project in detail with the advisor.

Week 3-4: 09/06/2010 - 09/19/2010

Download and study the code for YIOOP.

Week 5: 09/20/2010 - 09/26/2010

Deliverable 1: Perform a crawl using YIOOP and make changes to its UI.

Week 6-7: 09/27/2010 - 10/10/2010

Read the [APC2003] paper on OPIC in reference section and deliver a presentation.

Week 8: 10/11/2010 - 10/17/2010

Deliverable 2: Implement the OPIC algorithm using PHP

Week 9-10: 10/18/2010 - 10/31/2010

Read the [ZCTSR2004] paper on BM25F in reference section and deliver a presentation.

Week 11: 11/01/2010 - 11/07/2010

Deliverable 3: Implement the BM25F algorithm using PHP

Week 12-13: 11/08/2010 - 11/21/2010

Study about weightings and their role in BM25F Page ranking

Week 14: 11/22/2010 - 11/28/2010

Deliverable 4: Modify weighting strategy in YIOOP code.

Week 15-17: 11/29/2010 - 12/12/2010

Deliverable 5: Final 297 Report

Deliverables:

The full project will be done when CS298 is completed. The following will be done by the end of CS297:

1. Perform a crawl using YIOOP and make changes to its UI.

2. Implement OPIC algorithm using PHP.

3. Implement BM25F algorithm using PHP.

4. Modify weighting strategy in YIOOP code.

5. Final 297 Report.

References:

[APC2003] Serge Abiteboul and Mihai Preda and Gregory Cobena. Adaptive on-line page importance computation. In: Proceedings of the 12th international conference on World Wide Web. pp.280-290. 2003.

[ZCTSR2004] Hugo Zaragoza, Nick Craswell, Michael Taylor, Suchi Saria, and Stephen Robertson. Microsoft Cambridge at TREC-13: Web and HARD tracks. In Proceedings of 3th Annual Text Retrieval Conference. 2004.

[BSV2004] Paolo Boldi and Massimo Santini and Sebastiano Vigna. Do Your Worst to Make the Best: Paradoxical Effects in PageRank Incremental Computations. Algorithms and Models for the Web-Graph. pp. 168180. 2004.

Google's PageRank and Beyond: The Science of Search Engine Rankings----by Amy N. Langville and Carl D. Meyer

Wikipedia