Chris Pollett > Students >
Ravi

    ( Print View )

    [Bio]

    [Project Blog]

    [CS297 Proposal]

    [Deliverable1]

    [Deliverable2]

    [Deliverable3]

    [CS297 Report - PDF]

    [CS298 Proposal]

    [CS298 Progress Report Spring 2011- PDF]

    [CS298 Project Report - PDF]

                          

























CS298 Proposal

Improving the BM25F algorithm for use with OPIC based crawlers.

Ravi Inder Singh Dhillon (ravi.dhillon@yahoo.com)

Advisor: Dr. Chris Pollett

Committee Members: Dr. Robert Chun, Dr Jon Pearce

Abstract:

This project aims at improving the BM25F algorithm for use with OPIC-based crawlers. The BM25F ranking function calculates the page relevance by assigning weights to document fields and the anchor field. The title and body of a document are termed as document fields. The anchor field of a document refers to all the anchor text in the collection pointing to a particular document. Thus if a lot of unimportant links are pointing to a document they can increase the page relevance of an important web page for unimportant word searches that are not relevant to it. Hence the goal is to implement a modified BM25F by combining page rank computed by OPIC algorithm while computing weight for anchor field associated with a web page. The open source search engine YIOOP will be used as a case study for the project.The modified BM25F will provide a better ranking estimate for documents crawled by YIOOP. For the documents crawled in YIOOP, a posting list is set of all documents that contain a word in the index. This posting list is very large and needs to be trimmed to get the most relevent documents. We would run experiments on how far we should go in the posting list and decide on an optimum cutoff point for scanning posting list. At the end of the project we would come up with concrete results for improving page relevence calculation for OPIC based crawlers.

CS297 Results

  • Studied the code and working of YIOOP Open Source search engine
  • Studied and implemented OPIC algorithm in PHP
  • Studied and implemented BM25F algorithm in PHP

Proposed Schedule

Week 1: Feb.01-Feb.07Proposal
Week 2: Feb.08-Feb.15Study group iterator in YIOOP
Week 3,4: Feb.15-Feb.28Code variations in group iterator
Week 5,6: March.01-March14Run experiments for scanning posting list
Week 7,8: March.15-March.28Decide an optimum cutoff point for scanning posting list
Week 9: March.29-April.03Spring Recess
Week 10,11: April.04-April.17Combine relevence scores with OPIC
Week 12: April.18-April.24Optimize and finalize modifications in YIOOP
Week 13: April.25-May.01Start Writing Report
Week 14: May.02-May.08Finalize Report
Week 15: May.09-May.15Defense...

Key Deliverables:

  • Software
    • Deliverable 1: Perform a statistical study cutoff point in scanning search result posting list versus TREC result scores.
    • Deliverable 2: Determine an optimal cut-off point and give some theoretical justification as to why it might be reasonable.
    • Deliverable 3: Perform a statistical study of BM25F weighting components and OPIC weight versus TREC scores.
    • Deliverable 4: Suggest an optimal weigting scheme for BM25F based on results of Deliverable 3 and justify the results.
    • Deliverable 5: Write a new version of group iterator in YIOOP incorporating results from deliverable 1-4 and which is optimized and fast for large datasets.
  • Report
    • CS298 Report

Innovations and Challenges

  • Deliverable 1 is challenging because it requires finding an optimal cutoff point for scanning the posting list.
  • Deliverable 4 is innovative as it requires finding an efficient way to generate relevance scores for OPIC based crawlers.
  • Deliverable 5 is innovative and challenging as it invloves optimizing YIOOP to run fast and get results in resonable time for large datasets.

References:

[APC2003]Serge Abiteboul and Mihai Preda and Gregory Cobena.(2003). Adaptive on-line page importance computation. In: Proceedings of the 12th international conference on World Wide Web. pp.280-290.

[ZCTSR2004] Hugo Zaragoza, Nick Craswell, Michael Taylor, Suchi Saria, and Stephen Robertson. Microsoft Cambridge at TREC-13 (2004): Web and HARD tracks. In Proceedings of 3th Annual Text Retrieval Conference.

[BSV2004] Paolo Boldi and Massimo Santini and Sebastiano Vigna (2004). Do Your Worst to Make the Best: Paradoxical Effects in PageRank Incremental Computations. Algorithms and Models for the Web-Graph. pp. 168-180.

[AC2006] Amy N. Langville and Carl D. Meyer (2006). Google's PageRank and Beyond: The Science of Search Engine Rankings. Princeton University Press.

[Wiki2011] Okapi BM25F. Wikipedia, the free encyclopedia.