Nonparametric and Parametric Codes, Gap-Compression
CS267
Chris Pollett
Oct. 31, 2012
Outline
Finish up General Text Compression
Delta-values
Nonparametric and Parametric codes
HW Problem
Gap Compression
Finishing Up General Text Compression
Up till last day we were looking at general purpose text-compression algorithms.
Such algorithms might be suitable for compressing the raw text downloaded from a collection
of web sites.
For example, the file formats that the internet archive uses for web downloads, arc and warc, consist of
a file which contains a sequence of objects each of which is compressed using gzip.
We have looked at Huffman and Arithmetic coding, gzip and bzip use more sophisticated generalization of these
techniques.
The book has a slide comparing 0th, 1st, 2nd order variant of Huffman, Arithmetic coding version each other and versus
gzip and bzip2 for Shakespeare, TREC45, and Gov2.
Because Huffman has to store the code table, it performed the worst, followed by arithmetic coding followed by gzip, followed by bzip2 (2.5 times better than Huffman, and between 4.5 and 5 times compression over the original text).
There is a Huffman encoding scheme called LLRUN, which is sometimes used to do posting list compression.
Compressing Posting Lists: `Delta`-values
A posting list consists of a sequence of integers giving the doc id's of the document that contained a particular word.
So it looks like a sequence of increasing integers such as: `L = langle 3, 7, 11, 23, 29, 37, 41 ... rangle`
In the worst case you might imagine having to use a whole int to store a particular posting.
On the other hand if we look at the difference between adjacent pairs, the numbers involved are usually quite small (but all greater than 0):
`Delta(L) = \langle 3, 4, 4, 12, 6, 8, 4... rangle `.
These are called `Delta`-values. One of the games of index compression is to understand the distribution of `Delta`-values so
that one can find a good compression scheme for them.
It should be noted that if one has a positional index rather than a schema independent index it also makes sense to compute the delta values for the positions within a document.
Two common classes of codes used to do this are nonparametric and parametric codes: The former don't take into account the distribution when encoding, the latter do. Usually, the former are faster, the latter compress better.
Nonparametric Gap Compression
The simplest nonparametric code on the positive integers is the unary code: code `k` as k-1 0's followed by a 1.
This code is optimal if the `Delta`-values follow a geometric distribution of the form:
`Pr[Delta = k] = 2^(-k)`. This is usually not the case.
Elias (1975) proposed the `gamma` code which we have already mentioned.
To code `k` we use `lfloor log_2(k) rfloor` 0's followed k written in binary.
So `|\gamma(k)| = 2 lfloor log_2(k) rfloor + 1` bits.
The `gamma`-code is good for compressing gaps of size less than 32. It is optimal when the gaps are distributed according
to `1/(2k^2)`.
If we encode `k` by first encoding its length with a `gamma` code followed by its binary representation we get a `delta`-code. It performs better for larger gaps and is optimal for distributions of the form `1/(2k(\log k)^2)`.
One could imagine encoding `log^star k` using a unary code, followed by binary codes for each of the `m`-fold iterates of `log`
between `log^star k` and `0` to get a code for `k`. This is essentially the `omega` code for `k`.
Parametric Gap Compression
Unlike nonparametric codes, parametric ones take into account the specific characteristics of the list to be compressed.
Parametric codes fall into two categories: global, which choose a single parameter values that is used for all lists in the index; and local, which use a different value for each list.
We will mainly consider local methods on the next few slides...
One example of a global techniques is to to group gaps into buckets, `B_j = [2^j, 2^(j+1))`.
We use Huffman coding to encode the gaps seen and store the encoding table once. Then an individual gap is encoded as its Huffman code for its bucket, followed by its binary representation.
This scheme is called LLRUN.
It uses a 0th order language model. The book also talks about variants that use higher-order models.
HW Problem 6.1
Geometric Distributions and Posting Lists
Suppose the `Delta`-values follow a geometric distribution: `Pr[Delta = k] = (1 - p)^(k-1)p`.
This is not unreasonable: If our collection has N documents and term T appears in `N_T` of them. One would expect that the probability `p` that `T` appears in a given document chosen at random is something like `N_T/N`.
So if all documents are independent of each other, the probability of seeing a gap of size `k` between two subsequent occurrences of `T` is:
`Pr[Delta =k] = (1 - frac(N_T)(N))^(k-1)cdot frac(N_t)(N)`
The diagram below show this distribution:
Golomb/Rice Codes
Notice that most of the gaps fall around a common length. In the picture in the last slide this
was between length 6 and 8.
Call this common length the modulus.
One way to encode the gaps is to: (1) determine an appropriate modulus, (2) split each `Delta`-value `k` into two components:
a quotient q(k) and a remainder r(k) where:
`q(k) = lfloor (k-1)/M rfloor, r(k) = (k -1)mod M`
Encode `k` by writing `q(k)+1` in unary followed by `r(k)` as a `lfloor log(M) rfloor` bit or `|~ log(M) ~|` bit number.
For general `M` the above code is called a Golomb code after S. Golomb (1966). If `M` is a power of `2` then is
is called a Rice code after the paper that popularized this kind of coding R. Rice (1971).
Rice codes tends to be only slightly less efficient in terms of compression, but run faster.
Finding the Modulus
Recall a code `C` is optimal with respect to a given probability distribution `M` if the relationship
`|C(sigma_1)| = |C(sigma_2)|+ 1`
for two symbols `sigma_1` and `sigma_2` implies `M(sigma_1) = 1/2 cdotM(sigma_2)`.
We know that the Golomb codeword for integer `k+M` is `1` bit longer than the codeword for the integer `k`.
Therefore, the optimal parameter value for the modulus (call this modulus `M^star`) is:
`Pr[Delta = k + M^star] = 1/2Pr[Delta = k] iff (1 - N_t/N)^(k+M^star - 1) = 1/2 (1 - N_T/N)^(k-1)`
iff `M^star = frac(- log(2))(log(1 - N_T/N))`.
This is not usually an integer.
For Rice coding we need to choose between `M = 2^(lfloor log_2 M^star rfloor)` and `M = 2^(|~ log_2 M^star ~|)`.
For Golomb coding we choose between `\lfloor M^star rfloor` and `|~ M^star ~|`.
Gallagher and van Voorhis (1975) have shown that the optimal choice is:
`M_(opt) = |~ frac(log(2 - N_T/N))(log(1 - N_T/N)) ~|`.
Byte-Aligned Codes
If we want to improve compression and decompression speeds, it makes sense to look at compression methods
that are more geared to the way microprocessors in today's computers work.
Thus, it make sense to look at codes such that the split between code words falls on byte or word boundaries.
One of the simplest of these kinds of codes is vByte. In vByte, we use the top bit of a byte to say if the thing we are encoding continues to the next byte (1) or not (0).
We then use the remaining seven bits of each byte in our code as payload. And we encode from low order bit of the bit string we are encoding to high-order bit.
You can also have word-aligned codes which work on multiples of 32 or 64 bits.
One example of these is Simple-9 proposed by Anh and Moffat (2005). In this coding scheme, the first 4 bits of a 32 bits word server as a selector. The remaining 28 bits are split into chunks according to the particular selector.
For example, if the top four bits were 0000, then 28 1 bit numbers would be encoded in the word; if the selector was
0001, 14, 2 bit numbers could be encoded and so on.