Section notes, week 3: PA2

Revision Control Systems

Revisions vs. Versions

Every time we edit and save a document, such as a source code file, this creates a revision. This is automatically true, whether we think about revisions or not.

A version is created consciously by selecting a revision and assigning it a version identifier (e.g., Mozilla 1.4). Sometimes a version is published and becomes a release. Other versions are purely internal.

Revision Chaos

Say we are maintaining documents on a Unix file system with one file per document.

When we create a new revision, what happens to the old one? It's gone. What if the new revision introduced errors? We'd want to look at the old version to see how the error was introduced. What if we removed a feature that we need to add back?

So we need (1) a way of recording all revisions.

We could try backing up a file before we edit it. We could back up every revision, which would be a lot of work, or back up only important revisions, which requires us to know which revisions are important ahead of time.

Now that we have all these revisions, how do we know which is which? When were they created, which one came from which, what are the differences, etc.? For example, what if all the work for the past week is fatally flawed and we want to back up to the last revision before the week started? What if we want to know the exact revision where a certain bug was introduced?

We need (2) a way of naming revisions and (3) tracking their history.

Large projects have more than one developer. We're going to have problems if two developers try to change a file at the same time. We need (4) a way of managing concurrent development.

Often, revisions form a chain: a file is created as revision 0, edited to form revision 1, which is edited to form revision 2. Other times, we need something more. For example, your company has released Foobar 1.7 from revision 45 of Foobar.java, and is now adding features for the new Foobar 1.8. So far, the developers have created revisions 46, 47, and 48. Then a user reports a bug. You need to be able to fix the bug in revision 45 so you can release a patch, but you're still working on revision 48. So you put the bug fix in revision 45.1, branching the chain. Now you can fix bugs on one branch, and add features on the other. Later, we'll want to merge the bugfixes into the new version, so the branches aren't totally independent. We need a way of (5) branching and merging chains of revisions.

Revision Control

Because of the needs mentioned above, developers do not keep source code files or other important, changing documents as regular system files with arbitrary names. Instead, they use a revision control system with the five properties mentioned above. Note that a revision control system in this sense is a system in the real world, that incorporates developers, users, managers, written policies, computers, etc. However, it turns out to be very difficult to maintain a good revision control system without a lot of software support, so in practice most of it is done by software, which is then called a revision control system. But it still won't do a good job of managing revisions without developers that use it properly.

Software for revision control, such as CVS, helps with all 5 requirements:

Record all revisions. CVS maintains a database of all documents (such as source code). When a developer wants to edit a document, she checks it out to a local file, edits it, and then commits it to the database. The new revision is recorded in the database, and the record of the previous version is preserved. At any time, developers can extract prior revisions, or even roll back the copy of a file in the database to a previous revision.
Name revisions. CVS names revisions as 1.1, 1.2, 1.3, and so on.
Track revision history. CVS also records metadata such as the name of the previous revision, then time the revision was committed, the name of the developer that committed it, and a text comment from the developer. CVS can show you a log of this data. It can also show you the differences between revisions using 'diff'.
Manage concurrent development. This is a hard one, and definitely requires communication between developers as well as software support. There are several ways to do it. For example, Visual SourceSafe, another revision control software product, allows a developer to lock a file so no other developers may use it. The problem is that sometimes developers lock a file and then go on vacation. Or two developers want to edit separate parts of the file, but can't because only one can have the lock. Because of this, CVS doesn't support locks. Instead, each developer can check out documents and edit them normally. The first developer to commit notices nothing special. When the second developer tries to commit changes, he gets a message saying the file has been changed. Next, he updates his version, which applies the changes from the last revision to the database to his local copy. Then, he checks to make sure the changes are consistent with what he did. Finally, he has a file that has both his changes and the other developer's, so he commits it.
Branch and merge revision changes. CVS can do this too, essentially in the way described above. You probably won't need this feature for CS 164.

CVS and Eclipse

CVS is a popular open-source revision control system. See the CVS homepage for downloads and documentation.

Eclipse has built-in CVS support. CVS is accessed through the Team menu which is part of the context menu for Package Explorer items. There are tutorials available on the web.

PA2

Part of PA2 is to create a tool for automatically generating lexical analyzers from specifications based on regular expressions. Specifically, you will write code to convert a set of regular expressions with actions to a single NFA with actions.

Representing NFAs

Recall that a classic NFA has these parts:

An alphabet. In our case, this is just the data type char.
A set of states.
A start state.
A set of final states.
A transition function.

The core of the Java representation of an NFA is the NFAState class. Each instance represents one state in an NFA. NFAState has a boolean field to indicate if the state is a final state. The class also has a transition table, giving the next state for each character in the alphabet. The only part of the NFA left is the start state, which is recorded as a separate variable.

But we need a little more information for lexing. (Such as?) We need the action for each final state. Also, to apply the NFA construction rules, it helps to record the start and end state for each NFA.

RE->NFA

This is an interpretation process, just like PA1, except that the values are NFAs instead of integers.

If we do that, we end up with "Lexer Tool Alpha", shown below.

Lexer Tool Alpha includes the "Lexer Engine", which takes an NFA and an input program, and produces the token stream from the program. We'll go over how this works a little later. For now, just assume we've already written it.

All together, Lexer Tool Alpha takes in a spec and an input program, and puts out a token stream. This works fine, but there is an opportunity for improvement. (Which is?) We have to run the RE->AST and AST->NFA processes every time we use the lexer, even though they're always the same.

How do we fix this? Split the tool into a code generator, "Lexer Generator Beta", that runs at lexer-tool build time, and an engine that runs at run-time:

The lexer generator generates the NFA for lexing. Then, we package it together with the lexer engine, and we have a complete lexer that doesn't need to reparse the specification file. The generated lexer is a Java program that takes source text as input and returns a token stream, which is exactly what we want. In a way, Lexer Tool Alpha is an interpreter for lexer specifications, and Lexer Generator Beta is a compiler.

The only question left is how to implement the "NFA for Lexing" box in this diagram. We represent NFAs with Java objects, but the generated lexer consists of code. (So?) A solution is to generate Java code that calls constructors to create the objects. There is still a little to do at runtime, but at least we don't have to do parsing and RE->NFA conversion. We can actually generate the code directly using the traversal, just as before. It's a bit like the pretty-printer, in that we are using a traversal to translate an AST to another form, in this case, Java to create the corresponding NFA.

The Lexer Engine

The lexer engine reads in an NFA, and then runs it on input source text to produce a token stream. This uses the normal algorithm, which you are going to do in the homework. Here are a few of the high points.

At each step, the NFA is in a set of states, so we need a set data structure. We know the universe set ahead of time (all states), so we can use an array of booleans. To run one step, we have an array with the current state set, and an initially empty (all false) array for the next state set. For each state in the current set, find all the states that come next, given the current input, and set them in the next state set. After that, we check for all the epsilon transitions we can take from each state in the next state set, adding any epsilon-reachable states to the set.

The basic idea of lookahead is that when we reach a final state, we don't do the action. Instead we remember the current input position and the action and keep running. Once we get stuck in the NFA, only then do we run the last action that matched, which would be the maximal munch lexeme. We also back up in the input to the remembered position. We know we are stuck when the current state set is the empty set. It is easy to detect this condition using the representation discussed above.

One detail of lookahead is that we need to be able to back up in the input. This requires some kind of buffering. In Java, we can use a PushbackReader, as in the PA2 starter kit.