Introduction

As we mentioned when we first started talking about term-at-time query processing, most scoring functions for queries have the format:
`sc\o\re(q,d) = quality(d) + sum_(t in q) sc\o\re(t,d)`
Here `quality(d)` represents the quality of the document independent of the query.
Today, we are going to look at different ways that one might compute `quality(d)`.
In particular, we will be interested in `quality(d)` for documents on the web.
At a high level a document might be of high quality, if a user gains a lot of information by reading it. This might be the case if the document is easy to understand, contains good content, ...
These are all hard things to measure, so most quality measures try to use a stand-in for this.
We begin today looking at some "easier" to compute stand-ins for quality, and then move to some which are slightly more complex.

Traffic Rank

One way to estimate the quality of a page is by seeing which pages people use to navigate around the web.
Let `p_(ij)` be the proportion of all web traffic on the link from page `i` and page `j`.
`p_(ij) = 0` if there is no link from `i` to `j`.
We define the Traffic Rank of a page `j` to be the sum of all the traffic leading to `j`. i.e., `sum_i p_(ij)`.
Obviously, `p_(ij) geq 0` (*) and `sum_(i,j) p_(ij) = 1`, so the traffic rank is a probability measure. (**)
Also the flow into a page equals the flow out, so `sum_i p_(ij) - sum_i p_(ji) = 0` for every page `j`. (***)
We finally have the constraint that `p_(ij) = 0` if there is no link/traffic between `i` and `j`. (****).
Besides these four conditions, we could imagine that`p_(ij)` is otherwise unconstrained.
Tomlin (2003) shows how to formulate the above as a linear optimization problem where we are trying to maximize the entropy function `-sum_(i,j) p_(ij)log p_(ij)` subject to the constraints (*), (**), (***), and (****).
The entropy function is used because one can show it maximizes the freedom in choosing the `p_(ij)`'s.
Using Lagrange multiplier techniques it is actually possible to calculate the `p_(ij)` given knowledge of the web graph in time not much slower than what we'll get for page rank.
To keep things simpler, Amazon's Alexa company tries to directly measure the `p_(ij)` by using its browser toolbar to track who clicks on what from where to where.
Total toolbar traffic is then used as a stand-in for all web traffic. And so `p_(ij)`'s can be directly calculated from this. In turn, the traffic rank can be calculated from these `p_(ij)`'s.

Online Page Importance Computation (OPIC)

Abiteboul, Preda, Cobena (2003) propose a way to estimate document importance, which could be used as a surrogate for quality, while performing a web crawl.
When crawling the web, a web crawler needs to maintain a queue of links that it needs to crawl.
One easy way to crawl is just to do a breadth-first exploration of the web graph.
Instead, Abiteboul, Preda, Cobena propose using a best-first exploration -- always try to choose the best page not crawled, to crawl next.
Their online page importance algorithm (OPIC) makes use of a priority queue to keep pages arranged in order of importance.
Importance is measured by how much money a page has received.
Here we imagine that each seed site is initially given some amount of money, say 10 dollars. When a fetcher downloads the pages and extracts the links it distributes the money equally amongst all of the links for that page. So if there were 10 links on the page the fetcher would give them each a dollar and then send the summary, now including dollar figures, to the queue server.
It is completely possible for more than one page to point to a given other page. For instance, many pages have links to yahoo.com.
If the queue server gets from a fetcher a link that is already in its queue, it adds the money that the fetcher gave to the url to the value the queue server has for that url.
When the queue server wants to decide what to crawl next it picks the page that has the most money.
The queue server actually maintains two numbers for each page: the total amount of money a page has ever received, and the amount of money since the page was last crawled.
When a page is popped off the top of the priority queue, the second quantity is reset to zero. This algorithm does allow a page to be recrawled.
The total money a page ever received in a crawl is used as a measure of its importance.

In-Class Exercise

Suppose we start off a crawl from sites `S_1, S_2`. There are also sites on web, `S_3`, `S_4`, `S_5`.
The web graph is shaped like a square withc edges `S_1 -> S_2 -> S_3 -> S_4 -> S_1`. In addition, each of these sites links to `S_5` and `S_5` links to `S_1`.
Determine how OPIC would crawl these sites showing work until all sites crawled.
Post your solution to the May 5 In-Class Exercise Thread..

Page Rank

Google's famous PageRank (Brin Page 1998) algorithm assigns a rank, `r(P)`, to a page `P` roughly as the sum `sum_ i \frac{r(P_i)}{|P_i|}` over each page `P_i` which links to it. Here `|P_i|` is the number of links going out of page `P_i`.
Here the notion of rank is what we have been calling page quality.
Notice to calculate `r(P)`, we need to know `r(P_i)`.
To solve this, let `\vec{r}` be the vector of ranks of all web pages. i.e., the `j`th component of `\vec r` is `r(P_j)`. We assume the rank represents the probability a random person on the web happens to be on a particular page, so for now the ranks are between 0 and 1. Let `A_{ij}` be `1/|P_i|` if there is an edge from `i` to `j`; `0`, otherwise. Then roughly we want a `\vec{r}` such that `\vec{r} = \vec{A}\vec{r}`.

The Power Method

Start with a guess for `\vec{r}`, say `\vec{r'}`, and compute powers `\vec{A}^{(n)}\vec{r'}` until
`||\vec{A}^{(n+1)}\vec{r'} - \vec{A}^{(n)}\vec{r'}|| < \epsilon`
for whatever choice of `\epsilon` is you think is sufficiently small. Then, if we set `\vec{r} = \vec{A}^{(n)}\vec{r'}`, we will have
`\vec{A}\vec{r} = \vec{A}^{(n+1)}\vec{r'} \approx \vec{A}^{(n)}\vec{r'} = \vec{r}`
as desired.
This is called the power method for computing eigenvectors.
The Frobenius Perron Theorem shows you actually get convergence as `n` gets large provided `\vec{A}` meets certain conditions (which the adjacency matrix probably does not).

Tweaks on the basic matrix

The original Page Brin paper adds tweaks to the basic adjacency matrix known as the random surfer model to guarantee convergence.
For instance, whenever one has a node without outlinks, one pretends the node is connected to every other node on the web. This modification ensures the matrix `vec A` is stochastic. Let `e` denote the vector of all 1's. Formally, we let `vec A' = vec A + vec a (1/n e^T)` where `a_i = 1` if page `i` is a dangling node and 0 otherwise. The factor `vec a (1/n e^T)` we call the dangling node matrix.
It leaves the possibility though that the web is split into strongly connected components, which could mean that powering `\vec A'` would cycle through a set of vectors but not converge. To fix this, one takes a linear combination `\alpha \vec A' + (1-\alpha)\vec H` where `\vec H` is the matrix where every entry is `\frac(1)(\mbox{number of pages on the web})`. This modification can be viewed as saying that on any page on the web the random surfer might decide randomly to switch to any other page on the web. We call `H` the teleporter matrix.
Google supposedly uses `50` iterations to approximate `\vec{r}`.
The `r(P_i)` you get from the above calculations are numbers between `0` and `1`. Taking negative logs and scaling gives the more familiar values between `0` and `10`.

Topic Specific Page Rank

Page rank has been extensively studied.
People have proven mathematical theorems about its convergence rate based on the gap between the first and second eigenvalue, choice of alpha in the teleportation matrix, memory issues in its computation, etc.
We mention here briefly one interesting extension to page rank, which is to compute different page ranks based on topics and then use a linear combination of these.
This was originally proposed by Haveliwala (2002).
A source of topics was given by top level DMOZ (Open Directory Project) topics. DMOZ currently has been ressurected as curlie.org.
When a query comes in you use a Bayes classifier to compute scores of the probability for each topic of being relevant to the query.
Then the page rank for the page for the different topics are weighted by these factors to give a total page rank for the page.

More Algorithms -- HITS

Ask.com (Teoma) used to use another kind of algorithm called HITs (Kleinberg 1998).
This works by iteratively computing two scores for a page: (a) an authority score measuring the sum of hubs linking to it and (b) a hub score measuring how good the authorities linked from this site are.
In its original formulation, it was query dependent in its calculation of (a) and (b) -- we would only consider links on the key words.
To be more precise, in HITs we initialize two scores for each page: `x_i^((0))`, the authority score, and `y_i^((0))`, the hub score to one over the number of pages.
We then compute `x_i^((k)) = sum_(j:e_(ji) in E)y_j^((k-1))` and `y_i^((k)) = sum_(j:e_(ij) in E)x_j^((k))`
Notice `x_i^((k))` is the sum over the links into `i` of the hub scores of the previous round and `y_i^((k))` is the sum over the links out of `i` of the authority scores.
We iterate through `k` until `x_i^((k))` and `x_i^((k+1))` as well as `y_i^((k))` and `y_i^((k+1))` are less than some `epsilon`.
If `L` is the adjacency matrix then one can see `x^((k)) = L^TL x^((k-1))` and `y^((k)) = LL^T y^((k-1))`.
Convergence of the iterations comes from the fact that `L^TL` and `LL^T` are symmetric positive semidefinite matrices.
Convergence of HITS is typically faster than page rank requiring on the order of 10-15 iterations; however, it might suffer from uniqueness issues.
Although in its original formulation it was query dependent, in practice, it was run in a query independent fashion.

SALSA

SALSA (Stochastic Approach to Link Structure Analysis) was proposed by Lempal and Moran (2000).
If we look at the matrix `L` from the HITs slide you notice that we never normalized the sum of columns or rows to make then sum to 1.
On the other hand, when computing Page Rank we explicitly made sure the rows summed to 1 and so had a stochastic matrix.
The starting point of SALSA is to make two matrices `L_r` and `L_c` from `L`. In the first, we normalize the rows of `L` so they sum to 1. i.e., if a row has five `1` entries, we divide make each entry `1/5`. In the second, we normalize the columns to `1`.
We define a hub matrix `H` to be `L_rL_c^T` and we define an authority matrix `A` to be `L_c^TL_r`.
We then use these two matrices when iterating and computing the authority and hub vectors.
SALSA as an algorithm seems to be more immune to topic drift -- where a high ranking but off-topic page creeps up in the results than -- than HITS.
Like HITs the vector produced by their algorithm might not be unique (i.e., if start with different length one initialization vectors get different answers).

Document Quality Measures

Outline

Introduction

Traffic Rank

Online Page Importance Computation (OPIC)

In-Class Exercise

Page Rank

The Power Method

Tweaks on the basic matrix

Topic Specific Page Rank

More Algorithms -- HITS

SALSA