Index Compression




CS267

Chris Pollett

Oct. 24, 2012

Outline

Introduction

General-Purpose Data Compression

Symbolwise Data Compression

Modeling and Coding

Compression Models and Codes

`gamma`-codes

HW Problem

Exercise 5.4. Express the following queries in the region algebra.

(a) Find plays that contain "Birnam" followed by "Dunsinane".

Answer. ("<PLAY>" ... "</PLAY>") |> ("Birnam Dunsinane").
The above is immediately followed by. If you want that Dursiname occurs after Birnam could use "...". Note this returns the whole play.

(b) Find fragments of text that contain "Birnam" and "Dunsinane".

Answer. "Birnam" Δ "Dunsinane". I'll keep it this simple since it is unclear where exactly is meant by "fragments of text".

(c) Find plays in which the word "Birnam" is spoken by a witch.

Answer. ("<PLAY>" ... "</PLAY>") |> ("Birnam" <|("<LINE>" ... "</LINE>") < |
("<SPEECH>" ... "</SPEECH>") |> ("<SPEAKER>" ... "</SPEAKER>") |> "witch" )

(d) Find speeches that contain "toil" or "trouble" in the first line, and do not contain "burn" or "bubble" in the second line.

Answer. This ones somewhat painful so am only giving an approximate solution ...
("<SPEECH>" ... "</SPEECH>") |>
(("<LINE>" ... "</LINE>" |> "toil" `grad` "touble") ... ("<LINE>" ... "</LINE>")) NOT |> "burn" `grad` "bubble".

(e) Find a speech by an apparition that contains "fife" and that appears in a scene along with the line "Something wicked this way comes".

Answer. (("<SPEECH>" ... "</SPEECH>" |> "fife") |> ("<SPEAKER>" ... "</SPEAKER>") |> "apparition") Δ ("<SCENE>" ... "</SCENE>") |> ("<LINE>" ... "</LINE>" |> "Something wicked this way comes")

More on Prefix Codes

Making an optimal code tree

Making an optimal code tree

  • This would be optimal because of the following theorem from Shannon (1949)
    Source Coding Theorem. Given a symbol source S, emitting symbols from an alphabet A according to a probability distribution `P_A`, a sequence of symbols cannot be compressed to consume less than
    `H(S) = -sum P_A(sigma) cdot log(P_A(sigma))`
    bits per symbol on average. Here H(A) is called the entropy of the symbol source S.