Language Modeling, KL Divergence




CS267

Chris Pollett

Nov. 21, 2012

Outline

Introduction

Languages Models

Smoothing

HW Problem

Exercise 7.2. Suppose the search engine employs the INPLACE strategy with proportional pre-allocation (factor `k=2`). Prove that, for any given list, the total number of bytes transferred from/to disk is less than `5 times s`, where `s` is the list's size in bytes. You may assume that all postings have constant size.

Answer. Consider where we are going from a list of size `b` to a list of `k cdot b`. For us to have a list of size `b` we must have written `b` bytes to disk. In order to prepare the list of size `k cdot b` we must read these `b` bytes back in. If `M` was the original default posting list length. Then the total read/write operations to get a list of `s` bytes would be:

`2M + 2cdot 2 cdot M + 2 cdot 2^2 cdot M + cdots + 2 cdot 2^(log(s/M)) cdot M`
`= 2M(2^(log( frac(s)(M) ) + 1) -1) = 2M(2cdot(s/M) - 1) < 2M cdot 2cdot(s/M) < 5s`

Here we are using the formula for the geometric series and we are bounding by 5 rather than 4 to suck up any sloppiness factor caused by not rounding up on the logs and division operations.

Ranking with Language Models

Massaging our equations

Substituting in a particular model

Kullback-Leibler Divergence