1st computer(s)
- Difference engine (Ada Lovelace)
1800's entirely mechanical
programmed using punch cards
like the cards in a player piano or loom
Yup, the 1st programmer was a woman!
- ACE (Alan Turing)
memory state used standing waves in mercury tubes
- Eniac
monstrous vacuum tube
Machine code evolved into assembly then higher levels
abstracting away from any specific machine architecture
Machine independent interpreters
- read a line; decode the instruction; do it
- examples:
FORTH
Postscript
Basic
1st Machine-independent compilers, e.g., FORTRAN
- read the entire program
- convert into machine code
- examples:
FORTRAN
PL/1
Basic
- Big improvement over assembly/machine code
- Eveything was a hack, lex mixed with semantics mixed with structure,
no whitespace, rigid column fields, etc. -- Fortran
- There were many problems with the ad-hoc approach, language nuisances,
inconsistencies in syntax, etc. that cried out for a more formal approach.
(Textbook mentions some of these, incl. the space probe that crashed into
the surface of Venus because of a missing ',' in a FORTRAN DO loop that
didn't perform the necessary delay because it complied as an assignment
stmt "DO 34 I=1. 20000" FORTRAN strips spaces.
- Mathamatical approach to structuring the lexical and syntax sturcture of
a programming language
- Very useful for the front half of compiler
( b.t.w., the front-end is the part closest to reading characters, while
the backend is the part closest to writing the output code )
Practial applications:
- Modern Compilers use separate phases for the front-end:
1. lexers (to convert a character stream into tokens)
2. parsers (to convert a token stream into a concrete syntax tree,
aka parse tree)
3. semantic analysers (to convert a concrete syntax tree into an
abstract syntax tree)
With modern compilers the back 1/2 end of compiler still remains
somewhat ad-hoc, while employing algorithms for code generation or
interpretation specific to the programming language.
- Interactive Development Environments (IDE's), such as Eclipse,
store (in memory) the inputs and output of each phase: character
stream, token stream, etc., so that the front-end phases may operate
incrementally.
+-------------+
string --> | machine | --> "output"
+-------------+
Think of the "string" input as a very long string,
the size of a text file containing a Java program.
Visualize the entire contents of an editor's text buffer
as the input string.
- Recognizers (acceptors)
- reads the string an answers yes/no (membership in a formal language)
- validates the input string
- Translators (transducers)
- converts the input into some transformed output
- Incremental
- the input is modified (as by a text editor) output is
updated/regenerated in real-time
A lexical analyzer (a type of recognizer) produces the tokens
of a programing language.
- Usually these are the "words" separated by whitespace and delimiters
- Tokens often include the delimiters and the (aforementioned) words
- Sometimes whitespasce is given token status, othertimes whitespace often
is simply discarded
- Some words are tokenized as identifiers, other as keywords, others as
operator symbols, for example.
A "parser" groups the (token) words into sentences
- Concrete structure
- Builds an internal tree representation of a program
- Limited (theoretically) to structural only
(more on parsers and syntax later...)
A semantic analyzer picks up where parser left off
- Usually from parse tree or AST
- Type analysis
- du, ud chains
- prepares for optimization and code generation phases in back-end
Two independent efforts discovered equivalent formal techniques for parsing.
John Backus, Peter Naur (Backus-Naur Form 1959)
- worked on programming language issues
- also Niklaus Wirth (popularized Extended BNF)
Noam Chomsky (Contex Free Languagues/Grammars 1959)
- linguistics (founding father in computational linguistics field)
- Chomsky hierarchy
- Type 0 Grammars (right or Left linear grammars)
- Type 1 grammars (CFG's)
- Type 2 grammars (define recusively enumerable sets)
- Type 3 grammars (equivalent in computational power to Turing machines)
- also noted for his journalism, esp. anti-Vietnam War writings
In 1962 Ginsburg and Rice showed equivalence of CFG's and BNF's
Knids of (and example applications of) tokenizers / lecical analyzers
- Recognizer (many final states)
- Once read the input chars are discarded
- Emits token as side effect of entering a final state, afterwards
may enter another, possibly non final state, on subsequent characters
- Usually passes on the emitted token to other compilation phases
- Recognizer + position
- Also remembers char position (line no., col. no.)
- Incremental
- Remeber a state (intermediate or final) at every character position.
- Give example
- Common technology used in IDEs
- Often tokenizer works in concert with (feeds tokens to) parser
Practical considerations: - lex is a program that takes a re as input
and produces (internally) an NFA, then converts the NFA to a DFA then
save into a state table - the state table is output using a
2-dimensional C array - combined with a lex engine C code