We consider each one step derivation from the start
variable, then each two step derivation, etc. in turn.
If we ever see the string we want we accept.
If all the active derivations involve strings of terminals and variables longer than the string w we are searching for, we halt and reject.
To handle rules like `A->B`. Which can give derivations like `A => B => A`, we maintain a list of strings we have already seen. If we repeat, we prune that branch.
This is an exponential time algorithm; whereas, if we use a normal form for our grammars we can speed things up to be either cubic or in some cases linear time.
s-grammars
We would like parsing algorithms which run in linear time.
One way to achieve this is to restrict the kind of grammars we consider: Definition. A context-free grammar `G=(V, T, P, S)` is said to be a simple grammar or s-grammar if all its productions are of the form:
`A -> ax` where `A in V`, `a in T`, `x in V^star`, and any pair `(A,a)` occurs at most once in `P`.
For example, `S ->aS|bSS|c` is an s-grammar, but `S ->aS|bSS|aSS|c` is not because `(S,a)` occurs in `S ->aS` and `S ->aSS`.
Our brute force parsing algorithm will run in linear time with an s-grammar since at any given step there is at most one production which can be used. Further since the right hand side of a production always starts with a terminal we match at least one character of the input with each substitution.
So after at most linearly many substitution we know if the string is in the language.
s-grammars tend to be too restrictive to specify practical programming languages; nevertheless, they show the form of the rule is important to get efficient parsers.
We are now going to work towards some normal forms which will be useful in obtaining parsing algorithms for general CFGs.
To do this we will look at different ways to simplify our grammars.
To start sometimes it is useful to get rid of the empty strings from our language in order to make our proofs easier. It turns out this won't cause a loss of generality in the statements we can say about CFGs.
To see this, suppose `L` is a language and let `L'= L -{epsilon}`.
Let `G` be a CFG for `L`.
If `G'=(V,T,P,S)` is a CFG for `L'`, then `G = (V cup {S_0}, T, P cup {S_ 0 ->S | epsilon }, S_0)` will be a CFG for `L`.
So we will for now restrict our attention to grammars without `epsilon`.
More Methods of Transforming Grammars
Suppose we have a CFG `G=(V,T,P, S)` and let `A -> x_1Bx_2` be in `P`. Suppose the variable `B` occurs in the following productions in `G: B ->y_1| y_2| ldots |y_n`. Then if `G'` is the CFG obtained by replacing `A -> x_1Bx_2` by
`A -> x_1y_1x_2 | x_1y_2x_2 | ldots | x_1y_nx_2`, we will have `L(G')=L(G)`.
Another technique for simplifying CFGs is just to get rid of useless rules: Definition. Let `G=(V,T,P, S)` be a CFG. A variable `A` in `V` is said to be useful iff there is at least one `w in L(G)` such that `S =>^star xAy =>^star w`. Otherwise, `A` is said to be useless. A production is useless if it involves any useless variables.
For example, consider the grammar with rules `S ->A`, `A -> aA|epsilon`, `B ->bA`. Then `B` is useless as it is not reachable from the start variable. So the production `B -> bA` is useless.
Given a CFG if we eliminate all its useless productions we still get a smaller CFG with the same language.
To determine the useful variables and productions we can start with `V_1 = emptyset`. Then repeat the following until there are no more variables added to `V_1`: For each production `A -> x_1 ldots x_n`, with all `x_i`'s that are variables in `V_1`, add `A` to `V_1`. If the start variable is not in `V_1` then we know the language is empty, so we can delete all productions.
Otherwise, if `S` is in `V_1`, it still might not be the case that every variable in `V_1` is useful, so we set `V_2= {S}`. Then repeat the following until there are no more variables added to `V_2`: For each production `A ->x_1 ldots x_n`, with all `x_i`'s variables in `V_1` and with `A in V_2`, add each variable on the right hand side to `V_2`. After this procedure terminates, take `V_2` to be the set of useful variables. All other variables and production they are involved in are useless.
Removing `epsilon`-rules/productions
A production (rule) of a CFG of the form `A -> epsilon` is called a `epsilon`-production or `epsilon`-rule. Any variable for which `A=>^star epsilon`, is called nullable.
Even though a CFG might generate a language not containing `epsilon`, it still might have nullable productions. In which case these productions can be removed.
For example, in `S -> aCb`, `C -> aCb| epsilon`, the variable C is nullable. We can eliminate the `epsilon`-rule by doing substitutions to get: `S -> aCb|ab`, `C ->aCb| ab`.
To find the set `N` of nullable variables of a CFG, we can first put all variables `A` which occur in productions of the form `A -> epsilon` into `N`. Then repeat until no new variables are added the following step: if `B` occurs in a production
`B ->A_1A_2 ldots A_n` where each `A_i` is in `N`, then add `B` to `N`.
Once we have the set of nullable variables, we can eliminate any `epsilon`-
rules from our grammar and for each rule `C ->C_1C_2 ldots C_n` where a nullable variables occur we add a rule with each possible substitution of a nullable variable by `epsilon`.
Eliminate Unit Productions
A unit production is a production of the form `A -> B`.
In general, using the reachability algorithm we can
determine if `A=>^star C` for any two variables `A` and `C`.
If `C` occurs in the rules `C -> y_1|y_2| ldots |y_n`, then we can add the rule `A -> y_1|y_2| ldots |y_n` to our grammar without effecting the strings it generates. If we do this for all variables involved on the right hand side of a unit rule and for each `C` for which `A=>^star C`, then we have eliminated unit rules.