- One thing that I was wondering about was how the algorithms here scale
with the size of the CF grammar. I guess the running time figures
assume a fixed grammar. This might be fine for most problems
encountered in practice, but from my understanding of the
balanced-parentheses grammar, the number of non-terminals (in the
normalized form with rule right-hand sides of length at most 2) seems to
grow in proportion to the number of call sites, so scaling information
could be important.
- The paper does not mention that the "standard" algorithm for
CFL-reachability is not scalable, as it requires O(N^2) space. While
CFL-reachability is a nice theoretical framework, I'm not sure there has
been much work on efficient algorithms for solving the general
CFL-reachability problem. Also, the presentation of the exploded
supergraph sort of obscures how their efficient algorithm for
interprocedural dataflow analysis actually works, and makes it seem more
complicated than it actually is.
- The paper explains that the
interprocedural data-flow analysis algorithm described above takes time
N^3 * D^3, where N is the number nodes in the standard CFG and D is the
size of the set of data-flow facts. However, the paper also explains
that the analysis can be done in time E * D^3, where E is the number of
edges in the supergraph. Not knowing that algorithm, it's not clear to
me whether it's faster because (a) it solves CFR faster using
optimizations valid for balanced parentheses grammars, (b) it constructs
a different graph, or (c) it's just completely different and doesn't
particularly use CFR.
- Interprocedural dataflow analysis could be used to enforce a system of
simple user provided annotations specifying the temporal safety
properties of library calls, for example locking protocols or
get_range()/set_range() ala lightweight recoverable virtual memory. The
existence of a general framework for computing interprocedural dataflow
analyses would ease the implementation of such a system.
- There is a large class of
dataflow analysis problems that are not suited to this framework.
Analyses that model how program computes generally do not have
distributive dataflow functions. The D^3 term in the running time
can be large in practice.
- The techniques don't appear to be
useful for languages with non-trivial control flow or dynamic binding. It may
not be possible to construct a graph G for these languages, and even for
imperative languages it's not obvious how function pointers can be
accommodated. This is not, however, a limitation of this particular approach.
- Although CFL-reachability has O(n^3) complexity in the
general case, certain analysis problems can be solved faster due to the way
the graph is constructed. This is the real contribution of the paper. \
- The cubic running time of the general case CFL-reachability problem is a
bit problematic. The paper mentions that some problems are subsets of
this problem, and therefore asymptotically easier. A discussion of
language features that are the root of the algorithms' complexity would
have been helpful.
For instance, if a program does not make use of recursion, or can be
translated into a program that does not make use of recursion, then it
would be possible to 'inline' each function invocation. In this case,
linear algorithms for graph reachability could replace the cubic
algorithms mentioned in the paper with no loss of precision or
soundness. However, such a technique would increase the size of the
program being analyzed significantly. Even if this simple optimization
is not worthwhile, are there other program properties that could be
exploited?