# Deliverable 3

## Goal

The goal of this deliverable was to understand the working of BM25F algorithm and implement the algorithm in PHP from scratch.

## BM25 Algorithm

In information retrieval, Okapi BM25 is a ranking function used by search engines to rank matching documents according to their relevance to a given search query. It is based on the probabilistic retrieval framework developed in the 1970s and 1980s by Stephen E. Robertson, Karen Spärck Jones, and others. BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document, regardless of the inter-relationship between the query terms within a document (e.g., their relative proximity). It is not a single function, but actually a whole family of scoring functions, with slightly different components and parameters. One of the most prominent instantiations of the function is as follows.

Given a query *q*, containing keywords *q*_{1},...,q_{n}, the BM25 score of a document d is:

where *occurs*^{d}_{t} is the term frequency of *t* in *d*; *l*_{d} is the document length; *avl*_{d} is the document average length along the collection; *k*_{1} is a free parameter usually 2 and *b* belongs to [0,1] (usually 0.75). Assigning 0 to *b* is equivalent to avoid the process of normalisation and therefore the document length will not affect the final score. If *b* takes 1, we will be carrying out a full length normalisation.

where *N* is the number of document in the collection and *df* is the number of documents where appears the term *t*.

## BM25F Algorithm

BM25F is a newer variant of BM25 that can take document structure and anchor text into account. First we obtain the accumulated weight of a term over all fields as

where *l*_{c} is the field length; *avl*_{c} is the average length for the field *c* ; *bc* is a constant related to the field length, similar to *b* in BM25 and *boost*_{c} is the boost factor applied to field *c*. Next, a non-linear saturation is applied

and equal equation for *idf(t)*

where *N* is the number of document in the collection and *df* is the number of documents where appears the term *t*.

## BM25F Simulator

To simulate the BM25F algorithm a tool was created in PHP which takes as input 100 websites and extract all the terms that occur in them. It then computes the BM25F score for all these terms based on where they occur in the document. This information is stored in a table which is used by the tool which accepts a user query. Based on the query the tool generates the combined BM25F score of query for all the 100 pages. Below are the snapshots of the tool for a query.

## Source Code