Section Notes, week 2: Lexical Analyzers

RE -> NFA+actions

Goal: generate a lexical analyzer automatically from a specification. The specification is a list of REs (regular expressions), with an action for each RE. Generally the action is to emit one token. Lexical analyzer generation is done in a sequence of steps:

Spec (REs+actions) -> (NFAs+actions) -> Single NFA+actions -> DFA+actions -> DFA Tables

We're going to concentrate on the first two steps.

REs+actions -> NFAs+actions

All we need to do is convert each RE into an NFA. We learned the basic rules in lecture.

In a program (such as PA2), the REs are represented as ASTs. We convert each RE AST into an NFA by traversal.

Example

Let's convert an RE to an NFA: (a|b)*

The AST (the "true" form, by the way) for the RE is:

We traverse it in order, building up the NFAs:

Another idea: Instead of building up the NFAs, we could build up Java code that constructs the NFAs. We could include this code in the generated lexical analyzer, and we wouldn't need to ship the REs with the lexical analyzer at all.

Main points:

We can translate RE to NFA by traversal/visiting.
We can translate to many things, e.g., Java for generating NFAs.

Eclipse ASTs

Where do the ASTs come from? A parser.

Would you like to write a parser as part of PA2? How about using the Java parser instead?

Isn't that crazy? Not as long as we use Java syntax. Semantics (meaning) is different, of course. So you see, syntax alone has no meaning.

We need to pick an Eclipse Java AST node class for each RE node type.

*	PrefixExpression --
+	PrefixExpression ++
\|	InfixExpression \|
.	InfixExpression & (why not InfixExpression . ?)
'a'	IntegerLiteral Character.intValue('a')

Main points:

Syntax is structure without meaning.
We can assign any meaning we want, e.g., consider Java expressions as REs (and note that these expressions have no meaning in Java).

NFAs+actions -> Single NFA+actions

Now we want to combine all NFAs into a single NFA.

To combine the NFAs, it's just the OR construction. But what about the actions? Where do they go?

First try: put each action with the final state in the NFA it came with. When we reach the final state, do the action.

Problem: what if we have 2 REs, "for" and "foreach". We know that we want to match the longest. But when we have read 'f', 'o', 'r', we will do the "for" action right away. Bad!

Second try: when we reach the final state, remember state and position in input. Keep running NFA. When we get stuck, return to remembered state and position. Do the action, and start over.

Problem: what if we have 2 REs, "if", and "{alphanum}*"? On input 'i', 'f', space, they both match at same time?

Third try: assign priority to each RE. Take the one with the highest priority when both match. In practice, usually give the one that came first in the spec file higher priority.

Main points:

Lexical analyzer is based on an NFA,...
but extended with actions and priority
Lexical analyzer needs remembered lexemes/backtracking

Now it works. We don't have a traditional recognizer NFA anymore. Basic NFAs don't have actions or priority. But it works as a lexical analyzer.

[Example]

Writing Lexer Specifications

Language PL/0 has these kinds of lexemes: =, ==, +, -, identifiers, if, while, quoted strings. Any problems with this specification?


// Declarations

ALPHA = ['_', 'A'-'Z', 'a'-'z']

ALPHANUM = ['0'-'9', '_', 'A'-'Z', 'a'-'z']



// Rules

{ALPHA}{ALPHANUM}*     { return new Token(ID); }

'i' 'f'                { return new Token(IF); }

'w' 'h' 'i' 'l' 'e'    { return new Token(WHILE); }

'='                    { return new Token(EQUALS); }

'-'                    { return new Token(PLUS); }

'+'                    { return new Token(MINUS); }

'"' .* '"'             { return new Token(STRING); }

' '+                   { return nextToken(); }

"if", "while" analyzed as ID, need to switch order.
Need lexeme value for ID, STRING.
Need separate lexeme for "==" because it matters if == has space between.
String rule is wrong because of greedy match. Need to exclude quote, maybe newline too.
Keywords inside strings handled correctly because all the chars inside are part of the string and not the other rules.