Index Compression

CS267

Chris Pollett

Oct. 31, 2011

Outline

General-purpose Data Compression
Quiz
Compressing Posting Lists

Introduction

As we mentioned an inverted index can be quite large, especially if we have a full positional index.
The book gives an example which shows for the three text collections considered by the book: the Shakespeare Corpus (139%), TREC45 (122%) and GOV2 (77%), the corresponding uncompressed inverted index is typically larger than the original content. GOV2 would n' have been smaller is Javascript was not included in download size.
Since reducing the size of the index will both mean that we can index larger collections and as it will also speed-up the query retrieval from such collections we will now look at different techniques for compressing inverted indexes.
We begin by considering how compression algorithms work in general.

General-Purpose Data Compression

A data compression algorithm takes a chunk of data A and transforms it into another chunk of data B that is hopefully smaller that A.
Such an algorithm consists of two components: the encoder which takes A and produces B as above; and the decoder which takes B and outputs some C related to the original A.
A compression method can be either lossless or lossy.
If it is lossless than C in the above should be an exact copy of A; if it is lossy than C might be a "good enough" approximation of A (For example, JPEG encoder does a lossy compression of image, the decoder produces an image that is similar to the original.)
For index compression, we are mainly interested in lossless compression.

Symbolwise Data Compression

Given a chunk of data A, we will usually be concerned with the information contained in A rather than the precise bit string used for A. This is called the message, and we will denote it by M.
This message is often represented as a sequence of symbols from some set called an alphabet. I.e.,
`M = langle sigma_1, ... sigma_n rangle, sigma_i in S`
Data compression methods based on messages of symbols are called symbol-wise or statistical techniques.
Such techniques rely on the facts: (1) Not all symbols in M appear with the same frequency. By encoding more frequent symbols using fewer bits than less frequent ones we can save space. (2) The `i`th symbol in a sequence `langle sigma_1, ... sigma_n rangle` often depends on the previous symbols `langle sigma_1, ... sigma_(i-1) rangle` so we can save space in writing `sigma_i` by taking this into account. For example, we might code `qu` as one symbol as `q` is almost always followed by a `u`.

Modeling and Coding

Symbolwise compression methods often work in two phases: modeling and coding.
In the the modeling phase a probability distribution `M` is computed that maps symbols to their probability of occurrence.
In the coding phase the symbols in the message are re-encoded according to a code C.
A code is a map from symbols to codewords: `sigma |-> C(sigma)`
Methods can be further classified as static, semi-static, adaptive depending on the degree to which the model `M` is independent from the particular message being compressed.
You can also have first-order, second-order, and so on models depending whether the probability distribution `M` uses conditional probabilities.

Compression Models and Codes

Compression models and codes are connected: Any model M has a code (or family of codes) associated with it that minimizes the average codeword length for a sequence of symbols generated according to M. Similarly, any code has an associated probability distribution for which the code is optimal.
Consider the zeroth-order model M:
M(a) = 0.5, M(b) = 0.25, M(c) = 0.125, and M(d) = 0.125.
A code that is optimal with respect to M has the following property:
|C(a)|=1, |C(b)| = 2, |C(c)| = 3, and |C(d)| = 3.
An example of such a code is: C(a) = 0, C(b) = 11, C(c) = 100, and C(d) = 101.
Using this code, C(aababacd) would be encoded as 00110110100101.
Notice this code has the prefix-free property: no codeword is an initial substring (a prefix) of any other codeword.
This makes it possible to uniquely decode the strings we output. For example if we has set C(b) = 10, then it would have been a prefix of C(c) = 100 and it would be impossible to decode the string 100 as it could mean either ba or c.

`gamma`-codes

As another example of a code, let's briefly get ahead of ourselves and consider `gamma` codes
Given a sequence of offsets in a posting list: 100000, 100005, 100011,... one might have a sequence of quite large numbers which would take many bits to store.
However, the gap between successive numbers is small but always at least 1: 5, 6, so writing the first number and then the gaps gave save a lot of space (gap-compression).
To compress these numbers it suffices to be able to encode only numbers greater than or equal to 1.
To compress these small numbers: We write the number of bits we want in unary - 1, followed by a 1, followed by the number in binary.
So initially number 5 would be encoded as 001 101, 16 would be encoded as 00001 10000.
Notice the high-order bit of the binary code for the number is redundant given that we known the length of the number, so we can drop this bit to get the actual encoding. For 5, 001 101 becomes 001 01 and for 16 one gets 00001 0000.
For the homework, you are asked to show this code is prefix-free.

Quiz

Which of the following is true?

Document-at-a-time typically requires fewer disk seeks than term-at-a-time query processing.
The region algebra we defined for GC-lists is simpler to implement than Boolean queries.
A posting list can be expressed as a generalized concordance list.

More on Prefix Codes

A prefix code C can be thought as a binary tree in which each leaf node corresponds to a symbol `sigma`.
The labels (0 for left, 1 for right) encountered along the path from the root to a leaf correspond to the symbols codeword `C(sigma)`.
Without prefix freeness, some symbol would be associated with an internal node in this kind of tree.
Suppose for all of our symbols `M(sigma_i) = 2^(-lambda_i)` for some `lambda_i in NN`.
Because M is a probability distribution, we have
`sum_(i=1)^(n)M(sigma_i) = sum_(i=1)^(n)2^(-lambda_i) = 1`.
Let's try to find an optimal code tree for this distribution.

Making an optimal code tree

Every node in our tree must be either a leaf node (and have a codeword associated with it) or an internal node with exactly two children. i.e., A proper binary tree.
If the tree had an internal node with only one child, we could remove the internal node and shorten its descendant codewords by 1.
Let `L = {L_1, ..., L_n}` be the set of leaf nodes in a proper binary tree. Then: `sum_(i=1)^n2^(-d(L_i)) = 1`, where `d(L_i)` is the depth of node `L_i`.
Thus, it makes sense to try to choose our codewords to symbols in such a way that: `|C(sigma_i)| = d(L_i) = lambda_i = - log(M(sigma_i))`.
The resulting tree would represent an optimal code for the probability distribution `M` because the average number of bits per symbol used by `C` if we encode a sequence of symbols generated according to `M` would be: `sum_(i=1)^(n)Pr[sigma_i]cdot |C(sigma_i)| = -sum_(i=1)^(n) M(sigma_i) cdot log(M(sigma_i))`
because `|C(sigma_i)| = - log(M(S_i))`.

Making an optimal code tree

This would be optimal because of the following theorem from Shannon (1949)
Source Coding Theorem. Given a symbol source S, emitting symbols from an alphabet A according to a probability distribution `P_A`, a sequence of symbols cannot be compressed to consume less than
`H(S) = -sum P_A(sigma) cdot log(P_A(sigma))`
bits per symbol on average. Here H(A) is called the entropy of the symbol source S.

After the midterm, we will give a particular coding strategy called Huffman coding that achieves this bound.