Introduction to Homomorphisms

From time to time it is useful to be able to translate things from one language to another.
In the simplest case this might involve transliterating character by character, as when we convert Russian spellings of proper names into English. For example, Сталин -> STALIN.
As another example, we might want to encode our alphabet using an error correcting code.
In converting from one such language to another we have to be sensitive to the fact that sometimes an exactly equivalent string might not exist. For example, in English we have the words blue and turquoise which in some other language might both be translate to blue.
Homomorphisms allow us to do these kind of mappings for regular languages.

Definition of Homomorphism

Given two alphabets `Sigma` and `Sigma'`. A function `h:Sigma ->(Sigma')^star` is called a homomorphism. The domain of `h` can be extended to all strings over `Sigma^star` as follows: if `w=a_1a_2 ldots a_n` then `h(w) = h(a_1)h(a_2) ldots h(a_n)`.
Given a language `L`, its homomorphic image is defined to be `h(L) = {h(w)| w in L}`.
For example strings over `{a,b}` might be encoded to strings over `{0,1}` via the homomorphism: `h(a) = 000`, `h(b) = 111`.
In this case the language `L={aa, baba}` has as its homomorphic image: `h(L) = {000000, 111000111000}`.
A homomorphism does not have to be one to one. One could map `{a, b}` to the alphabet `{a}` via `h(a)=a`, `h(b) =a`. In which case `h(ababa) = aaaaa`.

Closure under Homomorphism

Theorem. Let `L` be a regular language over `Sigma` and let `h:Sigma ->(Sigma')^star` be a homomorphism. Then `h(L)` is a regular language over `Sigma'`.

Proof. We have shown that every regular language can be represented by a regular expression. Let `R` be the regular expression for `L`. We prove by induction on the complexity of `R` that `h(L)` will be regular. In the base `R` is either a symbol a of `Sigma` or it is the empty string, or it is the empty set. In the latter two cases `L(R) = L(h(R))`, so we are done. In the first case, we note that `h(a)` is a string over `Sigma'` and so will be a regular expression over the `Sigma'` alphabet. for the induction step, `R` is either of the form `R = (R_1R_2)`, `R = (R_1 cup R_2)`, or `R = (R_1)^star`. In each of these cases, we have by the induction hypothesis a regular expressions `R_1'` and `R_2'` for the homomorphic images of the languages of the subexpressions. So to make regular expressions for the homomorphic image of the language for `R` we can take either: `R' = (R_1'R_2')`, `R' = (R_1' cup R_2')`, or `R' = (R_1')^star`.

A generalization of closure under homomorphism, closure under substitutions, was first proven in Bar-Hillel, Perles, and Shamir 1961.

Quotients

A common problem in the computer processing of natural languages is to come up with stems of a given sequence of words.
For example, if we do a search engine search on fished, fishing, fishes, etc. as a preprocessing step this might be stemmed to just the word 'fish'.
We will next consider a notion of the quotient of two languages which allows us to formally consider things like stemming.
Definition. If `A` and `B` are two languages, their quotient `A/B` is the language: `{v | vw in A \ mbox(and)\ w in B}`.
So if `A={mbox(fished, fish, fishes, fishing, jumping, oranges)}` and `B={mbox(ing, ed)}` then `A/B = {mbox(fish, jump)}`.

Closure under Quotients

Theorem. If `A` and `B` are regular languages, then `A/B` is also a regular language.

Proof. Let `M=(Q, Sigma, delta, q_0, F)` be a DFA for `A` and let `M'` be a DFA for `B`. So a string `v` is in `A/B = (L(M))/(L(M'))` if `delta^star(q_0,v)=q_i` for some `i`, `delta^star(q_i,w) in F`, and `w in L(M')`. Let `M_i = (Q, Sigma, \delta, q_i, F)` , then `L(M_i) cap L(M')` is regular by HW2 Prob5. Notice `L(M_i) cap L(M')` is nonempty iff the two conditions `delta^star(q_i, w) in F`, and `w in L(M')` hold for some `w`. Further, we can check if `L(M_i) cap L(M')` is nonempty by seeing if the some accepting state of this language is reachable from the start state. Hence, we can make a machine for `A/B` as `(Q, Sigma, delta, q_0, F')` where `F'` are those state in `Q` such that `L(M_i) cap L(M')` is nonempty.

This result is due to Ginsberg and Spanier 1963.

Membership, Emptiness and Finiteness Checking

Theorem. Given a regular language `L` in standard representation and a string `w`, there are algorithms which can check: (a) if `w` is in `L`, (b) if `L` is empty, and (c) if `L` is finite.

Proof. The first step of each algorithm is to obtain a DFA for L. For (a) we can just use the Java program (modified with the correct transition table) we wrote a few lectures back to simulate a DFA on a input string. For (b), we view the transition function of the finite automata for `L` as specifying a labeled graph and we use the algorithm we gave earlier for reachability in a graph to check if an accept state is reachable from the start state. If it is, following the edge labels would give an element in `L` and so would mean `L` was non-empty. If no accept state was reachable, the language would be empty. For (c), let `M = (Q, Sigma, delta, s, F)` be a DFA for `L` . An algorithm to check if `L` is finite could cycle through each state `q` of `Q` and check if `q` is reachable from `s`. If it is it then check is `q` reachable from `q`. If it is, it finally checks whether any accepting state is reachable from`q`. If any `q` meets all three of these conditions then by cycling over the `q` to `q` path differing numbers of times we can show arbitrarily many strings are in the language. This is because `M` is a DFA so the loop must involve traversing some alphabet letters. So the language will not be finite. If no such `q` exists then we know that no accept state is reachable from any state associated with a cycle so the language must be finite.

The above Theorem is due to Moore (1956).

Remark. Suppose our regular language was presented as a regular expression `R`. The way we have been suggesting to perform check if `w` is in `L` is to convert `R` to an NFA `N`, do the powerset construction to get a DFA `D`, minimize the DFA, run the algorithm on the DFA. If `N` has `m` states then the size of D might be as large as `O(2^m)` which might mean a lot of space is used even after minimization. For this reason, people often try to simulate `w` directly on the NFA, keeping a list of states which one might be in at an given step. This reduces the space requirements but raises the runtime to `O(m cdot |w|)`.

The Pumping Lemma

Suppose we have a machine `M` with `k` states. Feed in some input string `w` of length `n>k`. At some point in the computation, by the pigeonhole principle, the machine must repeat a state.
Suppose `M` accepts `w`. Then we can imagine `M`'s computation splitting `w` into 3 pieces, `w=xyz`, according to the diagram:

More on the Pumping Lemma

But this implies that `M` accepts the strings `xz`, `xyyz`, `xyyyz`, etc.
We have thus established the Pumping Lemma:
Lemma (Pumping Lemma). If `A` is a regular language, then there is a number `p` (the pumping length) where, if `s` is any string in `A` of length at least `p`, then `s` may be divided into three pieces `s=xyz`, such that: for each `i geq 0`, `xy^iz` is in `A` for `|y| > 0`, and `|xy| leq p`.
The pumping lemma first appeared in Bar-Hillel, Perles, and Shamir. "On formal properties of simple phrase structure grammars," Z. Phonetik. Sprachwiss. Kommunikationsforsh. Vol 14. (1961). pp.143--172.

Using the Pumping Lemma

We can use the pumping lemma to show language are not regular. For example, let `C={ w| w \ mbox(has an equal number of 0's and 1's)}`. To prove `C` is not regular:

Suppose DFA `M` that recognizes `C`.
Consider the string `w = 0^p1^p`. This string is in the language and has length greater than `p`.
So by the pumping lemma `w = xyz`, where `|xy| leq p`, `|y| > 0`, and where `xy^iz` is in the language for all `i geq 0`. That means `x = 0^k` and `y=0^j` where `k+j leq p` and `j>0`. But then taking `i=0`, `xz = 0^(p-j)1^p` should be in `C`. As `p-j` is not equal to `p` this give a contradiction. So `C` is not regular.

More Examples

Show `L = {ww^R | w in Sigma^star}` is not regular.
- Suppose `M` is a DFA that recognizes `L`.
- Let `p` be `M`'s pumping length
- Consider the string `w = 0^p110^p`. This string is in the language and has length greater than `p`.
- So by the pumping lemma `w = xyz`, where `|xy| leq p`, `|y|> 0`, and where `xy^iz` is in the language for all `i geq 0`. That means `x = 0^k` and `y=0^j` where `k+j leq p` and `j>0`. But then taking `i=0`, `xz = 0^(p-j)110^p` should be in `L`. The two `11`'s not occur on the left hand half of `xz` and there are no `1`'s on the right hand half. So `xz` is not of the form string followed by reverse of the same string so in not in `L`, contradicting the pumping lemma. So L is not regular.
Show that `L = {w in Sigma^star | \ n_a(w) < n_b(w) }` is not regular. Here `n_x(w)` denotes the number of occurrences of alphabet symbol `x` in `w`.
- Suppose `M` is a DFA that recognizes `L`.
- Let `p` be `M`'s pumping length
- Consider the string `w = a^pb^(p+1)`. This string is in the language and has length greater than `p`.
- So by the pumping lemma `w = xyz`, where `|xy| leq p`, `|y| > 0`, and where `xy^iz` is in the language for all `i geq 0`. That means `x = a^k` and `y=a^j` where `k+j leq p` and `j>0`. But then taking `i=2`, `xy^2z = a^(p+j)b^(p+1)` should be in `L`. As `j>0`, `n_a(xy^2z) = p+j` is not less than `n_b(xy^2z )= p+1`. So `xy^2z` is not in `L`, contradicting the pumping lemma. So `L` is not regular.

Context Free Languages

We saw that regular languages were useful for doing things like string matching.
This might occur in practice as the so-called lexical analysis phase of compiler. That is, the phase in which we recognize tokens like language reserved words, variable names, constants, etc.
We now turn to ways of specify programming languages or even aspects of natural languages.
The key to this is to have some way to recognize the underlying structures such as nouns and verbs, or control blocks, etc of the language.
Context Free Grammars (CFGs), which are a less restricted form of grammar than a regular grammar, and their languages will provide us with the tools to do this.

Example CFG

Recall a grammar consists of a collection of substitution rules (aka productions). For instance: `A -> 0A1`, `A -> B`, `B -> #`
Usually, we'll write variables using uppercase letters or in brackets like `<mbox(variable)>`. Terminals are supposed to be strings over the alphabet of the language we are considering.
In a CFG, the left hand side of each rule has one variable; the right hand side can be a string of variables and terminals.
Variables can be substituted for; terminals cannot. One variable usually denoted by S is usually distinguishes as a start variable.
An example sequence of substitutions (aka a derivation) in the above grammar might be: `A => 0A1 => 00A11 => 00B11 => 00#11`

Homomorphism, Quotients, Pumping Lemma

Outline