Chris Pollett > Students >

    ( Print View )


    [Project Blog]

    [CS297 Proposal]




    [CS297 Report - PDF]

    [CS298 Proposal]

    [CS298 Progress Report Spring 2011- PDF]

    [CS298 Project Report - PDF]


Deliverable 3


The goal of this deliverable was to understand the working of BM25F algorithm and implement the algorithm in PHP from scratch.

BM25 Algorithm

In information retrieval, Okapi BM25 is a ranking function used by search engines to rank matching documents according to their relevance to a given search query. It is based on the probabilistic retrieval framework developed in the 1970s and 1980s by Stephen E. Robertson, Karen Spärck Jones, and others. BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document, regardless of the inter-relationship between the query terms within a document (e.g., their relative proximity). It is not a single function, but actually a whole family of scoring functions, with slightly different components and parameters. One of the most prominent instantiations of the function is as follows.

Given a query q, containing keywords q1,...,qn, the BM25 score of a document d is:

bm25 equation 1

where occursdt is the term frequency of t in d; ld is the document length; avld is the document average length along the collection; k1 is a free parameter usually 2 and b belongs to [0,1] (usually 0.75). Assigning 0 to b is equivalent to avoid the process of normalisation and therefore the document length will not affect the final score. If b takes 1, we will be carrying out a full length normalisation.

bm25 equation 2

where N is the number of document in the collection and df is the number of documents where appears the term t.

BM25F Algorithm

BM25F is a newer variant of BM25 that can take document structure and anchor text into account. First we obtain the accumulated weight of a term over all fields as

bm25f equation 1

where lc is the field length; avlc is the average length for the field c ; bc is a constant related to the field length, similar to b in BM25 and boostc is the boost factor applied to field c. Next, a non-linear saturation is applied

bm25f equation 2

and equal equation for idf(t)

bm25f equation 3

where N is the number of document in the collection and df is the number of documents where appears the term t.

BM25F Simulator

To simulate the BM25F algorithm a tool was created in PHP which takes as input 100 websites and extract all the terms that occur in them. It then computes the BM25F score for all these terms based on where they occur in the document. This information is stored in a table which is used by the tool which accepts a user query. Based on the query the tool generates the combined BM25F score of query for all the 100 pages. Below are the snapshots of the tool for a query.

bm25f snapshot 1

bm25f snapshot 2

Source Code