Index Compression




CS267

Chris Pollett

Oct. 31, 2011

Outline

Introduction

General-Purpose Data Compression

Symbolwise Data Compression

Modeling and Coding

Compression Models and Codes

`gamma`-codes

Quiz

Which of the following is true?

  1. Document-at-a-time typically requires fewer disk seeks than term-at-a-time query processing.
  2. The region algebra we defined for GC-lists is simpler to implement than Boolean queries.
  3. A posting list can be expressed as a generalized concordance list.

More on Prefix Codes

Making an optimal code tree

Making an optimal code tree

  • This would be optimal because of the following theorem from Shannon (1949)
    Source Coding Theorem. Given a symbol source S, emitting symbols from an alphabet A according to a probability distribution `P_A`, a sequence of symbols cannot be compressed to consume less than
    `H(S) = -sum P_A(sigma) cdot log(P_A(sigma))`
    bits per symbol on average. Here H(A) is called the entropy of the symbol source S.
  • After the midterm, we will give a particular coding strategy called Huffman coding that achieves this bound.