Tokens and Scanning

Definitions

The tokens or lexicon of a programming language, L, are informally defined as follows:

Whitespace = blanks, tabs, and newlines

Token = Any string that doesn't contain whitespace

Tokens(L) = all legal tokens of language L =
   Literals(L) U
   Identifiers(L) U
   Punctuation(L) U
   Delimiters(L) U
   Operators(L)

Literals(L) = Strings U Numbers U Characters U Boolean U etc.

Identifiers(L) = legal names U Keywords(L)

Punctuation(L) = , . ; : etc.

Delimiters(L) = ( ) { } [ ] " " ' ' etc.

Operators(L) = + * - / < > = && || ! etc.

Keywords(L) = if else switch while do for etc.

If an illegal token is like a misspelled word, then a lexicon is like a dictionary and a scanner is like a spell checker.

Regular Expressions

Tokens(L) is more formally defined as the set of all tokens that match some pattern, P:

Tokens(L) = L[P] = all tokens that match pattern P

We define patterns (regular expressions) and L[P] by simultaneous recursions:

1. P = any token, L[P] = {P}

If A and B are patterns, then so are:

2. P = AB,  L[P] = L[A]L[B] = {ab | a in L[A] & b in L[B]}
3. P = A|B, L[P] = L[A] U L[B]
4. P = (A), L[P] = L[A]
5. P = A?,  L[P] = L[A] U {""}
6. P = A+,  L[P] = L[A] U L[A+]
7. P = A*,  L[P] = L[(A+)?]

Non-Terminals

It's sometimes convenient to allow patterns to contain variables representing previously defined patterns. A pattern variable (also called a non-terminal symbol) is bracketed by angle braces and defined using the "consists of" operator:

<Pattern> ::= some pattern

Although pattern variables can be defined in terms of other pattern variables, we must be careful at this stage to avoid recursive definitions.

Here are some commonly used patterns:

<LC> ::= a | b | ... | z
<UC> ::= A | B | ... | Z
<Letter> ::= <LC> | <UC>
<Digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
<Number> ::= (+ | -)?<Digit>+(.<Digit>+)?
<ID> ::= <Letter>(<Letter> | <Digit>)*

Examples

L[01] = { 01 }
L[01?] = { 0, 01 }
L[(01)?]  = { "", 01 }
L[01*] = { 0, 01, 011, 0111, etc. }
L[(01)*] = {"", 01, 0101, 010101, etc. }
L[0 | (00)+] = { 0, 00, 0000, 000000, etc. }

Tokens(Java)

Identifiers(Java) ::= L[<First><Rest>*]
<First> ::= <Letter> | _ | $
<Rest> ::= <First> | <Digit>

Operators(Java) ::= <PREFIX> | <INFIX> | <POSTFIX>
<INFIX> ::= <ARITHMETIC> | <LOGICAL> | <BITWISE> | <ASSIGNMENT>
<ARITHMETIC> ::= + | - | * | / | %
<LOGICAL> ::= < | > | <= | >= | == | instanceof | == | != | && | ||
<BITWISE> ::= << | >> | >>> | & | |
<ASSIGNMENT> ::= = | += | *= | ...
<PREFIX> ::= ++ | -- | - | ~ | ! | (<TYPE>)
<POSTFIX> ::= ++ | -- | () | [] | .

Tokens(Scheme)

Tokens(C++)

Identifiers(C++) ::= L[<First><Rest>*]
<First> ::= <Letter> | _
<Rest> ::= <First> | <Digit>

Scanners

A scanner is a language processor component that transforms an input program, P, from a sequence of characters into a sequence of tokens. If illegal tokens are encountered, an exception is thrown.

TokenList scan(String program) throws IllegalTokenError {
   // extract tokens from program and put them in a list
}

There are programs such as Lex that can generate scanners from regular expressions.

jutil.Lex.java (Requires JDK 1.4 or later)

package jutil;
import java.util.regex.*;
import java.util.*;

public class Lex {

   public static boolean DEBUG = false;
   public static String FLOAT = "[0-9]*\\.?[0-9]+";
   public static String ID = "[a-zA-Z][a-zA-Z0-9]*";
   public static String OPERATOR
      ="\\+|\\*|-|/|=|<|>|<=|>=|&|!|%|\\|";
   public static String EXP
      = OPERATOR + "|" + FLOAT + "|" + ID;
   public static String PUNC = ";|\\(|\\)";

   public static boolean matches(String regExp, String phrase) {
      Pattern p = Pattern.compile(regExp);
      Matcher m = p.matcher(phrase);
      boolean b = m.matches();
      if (DEBUG) System.out.println("match = " + b);
      return b;
   }
   public static boolean checkTokens(String regExp, String phrase) {
      String regExps = "("+regExp+"|\\s)+";
      return matches(regExps, phrase);
   }
   public static List scan(String regExp, String phrase) {
      List tokens = new Vector();
      Pattern p = Pattern.compile(regExp);
      Matcher m = p.matcher(phrase);
      while(m.find()) {
         String token = m.group();
         tokens.add(token);
      }
      if (DEBUG) System.out.println("tokens = " + tokens);;
      return tokens;
   }
   // test driver:
   public static void main(String[] args) {
      DEBUG = true;
      checkTokens(EXP, args[0]);
      scan(EXP, args[0]);
      DEBUG = false;
   }
}