Chris Pollett >
Students > [Bio] [CS298 Progress Report Spring 2011- PDF] |
CS298 ProposalImproving the BM25F algorithm for use with OPIC based crawlers.Ravi Inder Singh Dhillon (ravi.dhillon@yahoo.com) Advisor: Dr. Chris Pollett Committee Members: Dr. Robert Chun, Dr Jon Pearce Abstract:This project aims at improving the BM25F algorithm for use with OPIC-based crawlers. The BM25F ranking function calculates the page relevance by assigning weights to document fields and the anchor field. The title and body of a document are termed as document fields. The anchor field of a document refers to all the anchor text in the collection pointing to a particular document. Thus if a lot of unimportant links are pointing to a document they can increase the page relevance of an important web page for unimportant word searches that are not relevant to it. Hence the goal is to implement a modified BM25F by combining page rank computed by OPIC algorithm while computing weight for anchor field associated with a web page. The open source search engine YIOOP will be used as a case study for the project.The modified BM25F will provide a better ranking estimate for documents crawled by YIOOP. For the documents crawled in YIOOP, a posting list is set of all documents that contain a word in the index. This posting list is very large and needs to be trimmed to get the most relevent documents. We would run experiments on how far we should go in the posting list and decide on an optimum cutoff point for scanning posting list. At the end of the project we would come up with concrete results for improving page relevence calculation for OPIC based crawlers. CS297 Results
Proposed Schedule
Key Deliverables:
Innovations and Challenges
References:[APC2003]Serge Abiteboul and Mihai Preda and Gregory Cobena.(2003). Adaptive on-line page importance computation. In: Proceedings of the 12th international conference on World Wide Web. pp.280-290. [ZCTSR2004] Hugo Zaragoza, Nick Craswell, Michael Taylor, Suchi Saria, and Stephen Robertson. Microsoft Cambridge at TREC-13 (2004): Web and HARD tracks. In Proceedings of 3th Annual Text Retrieval Conference. [BSV2004] Paolo Boldi and Massimo Santini and Sebastiano Vigna (2004). Do Your Worst to Make the Best: Paradoxical Effects in PageRank Incremental Computations. Algorithms and Models for the Web-Graph. pp. 168-180. [AC2006] Amy N. Langville and Carl D. Meyer (2006). Google's PageRank and Beyond: The Science of Search Engine Rankings. Princeton University Press. [Wiki2011] Okapi BM25F. Wikipedia, the free encyclopedia. |