CS298 Proposal
Improving the BM25F algorithm for use with OPIC based crawlers.
Ravi Inder Singh Dhillon (ravi.dhillon@yahoo.com)
Advisor: Dr. Chris Pollett
Committee Members: Dr. Robert Chun, Dr Jon Pearce
Abstract:
This project aims at improving the BM25F algorithm for use with OPIC-based crawlers. The BM25F ranking function calculates the page relevance by assigning
weights to document fields and the anchor field. The title and body of a document are termed as document fields. The anchor field of a document refers to all
the anchor text in the collection pointing to a particular document. Thus if a lot of unimportant links are pointing to a document they can increase the page
relevance of an important web page for unimportant word searches that are not relevant to it. Hence the goal is to implement a modified BM25F by combining
page rank computed by OPIC algorithm while computing weight for anchor field associated with a web page. The open source search engine YIOOP will be used as
a case study for the project.The modified BM25F will provide a better ranking estimate for documents crawled by YIOOP. For the documents crawled in YIOOP, a
posting list is set of all documents that contain a word in the index. This posting list is very large and needs to be trimmed to get the most relevent
documents. We would run experiments on how far we should go in the posting list and decide on an optimum cutoff point for scanning posting list. At the end
of the project we would come up with concrete results for improving page relevence calculation for OPIC based crawlers.
CS297 Results
- Studied the code and working of YIOOP Open Source search engine
- Studied and implemented OPIC algorithm in PHP
- Studied and implemented BM25F algorithm in PHP
Proposed Schedule
Week 1:
Feb.01-Feb.07 | Proposal |
Week 2:
Feb.08-Feb.15 | Study group iterator in YIOOP |
Week 3,4:
Feb.15-Feb.28 | Code variations in group iterator |
Week 5,6:
March.01-March14 | Run experiments for scanning posting list |
Week 7,8:
March.15-March.28 | Decide an optimum cutoff point for scanning posting list |
Week 9:
March.29-April.03 | Spring Recess |
Week 10,11:
April.04-April.17 | Combine relevence scores with OPIC |
Week 12:
April.18-April.24 | Optimize and finalize modifications in YIOOP |
Week 13:
April.25-May.01 | Start Writing Report |
Week 14:
May.02-May.08 | Finalize Report |
Week 15:
May.09-May.15 | Defense... |
Key Deliverables:
- Software
- Deliverable 1: Perform a statistical study cutoff point in scanning search result posting list versus TREC result scores.
- Deliverable 2: Determine an optimal cut-off point and give some theoretical justification as to why it might be reasonable.
- Deliverable 3: Perform a statistical study of BM25F weighting components and OPIC weight versus TREC scores.
- Deliverable 4: Suggest an optimal weigting scheme for BM25F based on results of Deliverable 3 and justify the results.
- Deliverable 5: Write a new version of group iterator in YIOOP incorporating results from deliverable 1-4 and which is optimized and fast for large
datasets.
- Report
Innovations and Challenges
- Deliverable 1 is challenging because it requires finding an optimal cutoff point for scanning the posting list.
- Deliverable 4 is innovative as it requires finding an efficient way to generate relevance scores for OPIC based crawlers.
- Deliverable 5 is innovative and challenging as it invloves optimizing YIOOP to run fast and get results in resonable time for large datasets.
References:
[APC2003]Serge Abiteboul and Mihai Preda and Gregory Cobena.(2003). Adaptive on-line page
importance computation. In: Proceedings of the 12th international conference on World
Wide Web. pp.280-290.
[ZCTSR2004] Hugo Zaragoza, Nick Craswell, Michael Taylor, Suchi Saria, and Stephen Robertson.
Microsoft Cambridge at TREC-13 (2004): Web and HARD tracks. In Proceedings of 3th
Annual Text Retrieval Conference.
[BSV2004] Paolo Boldi and Massimo Santini and Sebastiano Vigna (2004). Do Your Worst to
Make the Best: Paradoxical Effects in PageRank Incremental Computations. Algorithms
and Models for the Web-Graph. pp. 168-180.
[AC2006] Amy N. Langville and Carl D. Meyer (2006). Google's PageRank and Beyond: The
Science of Search Engine Rankings. Princeton University Press.
[Wiki2011] Okapi BM25F. Wikipedia, the free encyclopedia. |