A grammar G is ambiguous if for some w in L(G)
there are two different parse trees for w.
For example, let G be the grammar:
EXP ::= EXP~(+|*)~EXP | NUMBER
While elegant, grammar G gives us too much freedom to
construct a derivation.
For example, let w = 3 + 4 * 5. We begin our derivation with
EXP, but then which operator should we derive first, + or *?
Let's derive + first. Here's the entire derivation:
EXP =>
EXP + EXP =>
EXP + EXP * EXP =>
EXP + EXP * NUMBER =>
EXP + EXP * 5 =>
EXP + NUMBER * 5 =>
EXP + 4 * 5 =>
NUMBER + 4 * 5 =>
2 + 4 * 5
Here's the corresponding parse tree:
Notice that the + node is closer to the root than the *
node.
Now let's derive w again, but this time deriving * first.
Here's the derivation:
EXP =>
EXP * EXP =>
EXP + EXP * EXP =>
NUMBER + EXP * EXP =>
3 + EXP * EXP =>
3 + NUMBER * EXP =>
3 + 4 * EXP =>
3 + 4 * NUMBER =>
3 + 4 * 5
And the corresponding parse tree:
In this case the * node is closer to the root than the +
node.
So what? Does it make a difference? Imagine each of these
trees being passed to an interpreter for execution. The interpreter would use
recursion to evaluate the left and right children of the root, then use the
operator represented by the middle child to combine the roots:
double eval(Tree t) {
double arg1 = eval(t.left);
double arg2 = eval(t.right);
char op = t.middle;
if (op == "+") return arg1 +
arg2;
if (op == "*") return arg1 *
arg2;
// etc.
}
In the first case the multiplication will be performed
first. In other words:
3 + 4 * 5 = 3 + (4 * 5) = 3 + 20 = 23
In the second case the addition will be performed first:
3 + 4 * 5 = (3 + 4) * 5 = 7 * 5 = 35
Different answers! That's bad.
To fix this problem we must try to find an unambiguous
grammar G' such that L(G) = L(G'). We must try to
eliminate choices.
Consider the following grammar G' that forces
multiplications to be performed before additions:
EXP::= PRODUCT | PRODUCT + EXP
PRODUCT::= NUMBER | NUMBER * PRODUCT
Here's a G' derivation of w:
EXP =>
PRODUCT + EXP =>
PRODUCT + PRODUCT
3 + PRODUCT =>
3 + NUMBER * PRODUCT =>
3 + 4 * PRODUCT =>
3 + 4 * NUMBER =>
3 + 4 * 5
(Notice that we have choices within the derivation as to
which variable to replace first: left-most, right-most, etc.) While it is
beneficial to follow an algorithm, all of these choices will yield the same parse
tree.)
Here's the parse tree:
Notice that there were no choices to be made between which
operator should come first: + or *.
It can be proved that L(G) = L(G')
and G' is unambiguous, however finding an equivalent unambiguous grammar for an
arbitrary CFG is famously unsolvable:
Theorem There is
no algorithm that implements the function:
// returns unambiguous G' with L(G) = L(G')
CFG disambiguate(CFG G) { ??? }
This means that disambiguating a CFG is an art rather than a
technique.
To make matters worse, there are CFLs for which there is no
unambiguous grammar. Such languages are called inherently ambiguous.
Theorem: L = {anbncmdm
| 1 <= n && 1 <= m} U {anbmcmdn
| 1 <= n && 1 <= m} is inherently ambiguous.
Proof:
The idea is to show that strings of the form anbncndn must
have two distinct left-most derivations.