Outline
- Brute Force Parsing
- s-grammars
- Methods for transforming grammars
- Chomsky Normal Form
Brute Force Parsing
- One way to do parsing is by exhaustive search.
- We consider each one step derivation from the start
variable, then each two step derivation, etc. in turn.
- If we ever see the string we want we accept.
- If all the active derivations involve strings of terminals and variables longer than the string w we are searching for, we halt and reject.
- To handle rules like `A->B`. Which can give derivations like `A => B => A`, we maintain a list of strings we have already seen. If we repeat, we prune that branch.
- This is an exponential time algorithm; whereas, if we use a normal form for our grammars we can speed things up to be either cubic or in some cases linear time.
s-grammars
- We would like parsing algorithms which run in linear time.
- One way to achieve this is to restrict the kind of grammars we consider:
Definition. A context-free grammar `G=(V, T, P, S)` is said to be a simple grammar or s-grammar if all its productions are of the form:
`A -> ax` where `A in V`, `a in T`, `x in V^star`, and any pair `(A,a)` occurs at most once in `P`.
- For example, `S ->aS|bSS|c` is an s-grammar, but `S ->aS|bSS|aSS|c` is not because `(S,a)` occurs in `S ->aS` and `S ->aSS`.
- Our brute force parsing algorithm will run in linear time with an s-grammar since at any given step there is at most one production which can be used. Further since the right hand side of a production always starts with a terminal we match at least one character of the input with each substitution.
- So after at most linearly many substitution we know if the string is in the language.
- s-grammars tend to be too restrictive to specify practical programming languages; nevertheless, they show the form of the rule is important to get efficient parsers.
Methods for Transforming Grammars
- We are now going to work towards some normal forms which will be useful in obtaining parsing algorithms for general CFGs.
- To do this we will look at different ways to simplify our grammars.
- To start sometimes it is useful to get rid of the empty strings from our language in order to make our proofs easier. It turns out this won't cause a loss of generality in the statements we can say about CFGs.
- To see this, suppose `L` is a language and let `L'= L -{epsilon}`.
be a CFG for L.
- If `G'=(V,T,P,S)` is a CFG for `L'`, then `G = (V, T, P cup {S_ 0 ->S | epsilon }, S_0)` will be a CFG for `L`.
- So we will for now restrict our attention to grammars without `epsilon`.
More Methods of Transforming Grammars
- Suppose we have a CFG `G=(V,T,P, S)` and let `A -> x_1Bx_2` be in `P`. Suppose the variable `B` occurs in the following productions in `G: B ->y_1| y_2| ldots |y_n`. Then if `G'` is the CFG obtained by replacing `A -> x1Bx2` by
`A -> x_1y_1x_2 | x_1y_2x_2 | ldots | x_1y_nx_2`, we will have `L(G')=L(G)`.
- Another technique for simplifying CFGs is just to get rid of useless rules:
Definition. Let `G=(V,T,P, S)` be a CFG. A variable `A` in `V` is said to be useful iff there is at least one `w in L(G)` such that `S =>^star xAy =>^star w`. Otherwise, `A` is said to be useless. A production is useless if it involves any useless variables.
- For example, consider the grammar with rules `S ->A`, `A -> aA|epsilon`, `B ->bA`. Then `B` is useless as it is not reachable from the start variable. So the production `B -> bA` is useless.
- Given a CFG if we eliminate all its useless productions we still get a smaller CFG with the same language.
- To determine the useful variables and productions we can start with `V_1 = emptyset`. Then repeat the following until there are no more variables added to `V_1`: For each production `A -> x_1 ldots x_n`, with all `x_i`'s that are variables in `V_1`, add `A` to `V_1`. If the start variable is not in `V_1` then we know the language is empty, so we can delete all productions.
- Otherwise, if `S` is in `V_1`, it still might not be the case that every variable in `V_1` is useful, so we set `V_2= {S}`. Then repeat the following until there are no more variables added to `V_2`: For each production `A ->x_1 ldots x_n`, with all `x_i`'s variables in `V_1` and with `A in V_2`, add each variable on the right hand side to `V_2`. After this procedure terminates, take `V_2` to be the set of useful variables. All other variables and production they are involved in are useless.
Removing `epsilon`-rules/productions
- A production (rule) of a CFG of the form `A -> epsilon` is called a `epsilon`-production or `epsilon`-rule. Any variable for which `A=>^star epsilon`, is called nullable.
- Even though a CFG might generate a language not containing `epsilon`, it still might have nullable productions. In which case these productions can be removed.
- For example, in `S -> aCb`, `C -> aCb| epsilon`, the variable C is nullable. We can eliminate the `epsilon`-rule by doing substitutions to get: `S -> aCb|ab`, `C ->aCb| ab`.
- To find the set `N` of nullable variables of a CFG, we can first put all variables `A` which occur in productions of the form `A -> epsilon` into `N`. Then repeat until no new variables are added the following step: if `B` occurs in a production
`B ->A_1A_2 ldots A_n` where each `A_i` is in `N`, then add `B` to `N`.
- Once we have the set of nullable variables, we can eliminate any `epsilon`-
rules from our grammar and for each rule `C ->C_1C_2 ldots C_n` where a nullable variables occur we add a rule with each possible substitution of a nullable variable by `epsilon`.
Eliminate Unit Productions
- A unit production is a production of the form `A -> B`.
- In general, using the reachability algorithm we can
determine if `A=>^star C` for any two variables `A` and `B`.
- If `C` occurs in the rules `C -> y_1|y_2| ldots |y_n`, then we can add the rule `A -> y_1|y_2| ldots |y_n` to our grammar without effecting the strings it generates. If we do this for all variables involved on the right hand side of a unit rule and for each `C` for which `A=>^star C`, then we have eliminated unit rules.