Figure-ground organization is a step of perceptual organization which
assigns a contour to one of the two abutting regions. Commonly thought to
follow region segmentation, it is an essential step in forming our perception
of surfaces, shapes and objects, as demonstrated by the pictures in Figure
1. These pictures are highly ambiguous and we may perceive either side as
the figure and ``see'' its shape. We always perceive the ground side as being
shapeless and extended behind the figure, never seeing both shapes
Figure 1: classical illusions in figure/ground organization.
In computer vision, figure-ground organization is an all quiet front and
virtually no effort has been made to understand its roles and implications.
Perhaps this is for a good reason: figure-ground organization is a difficult
mid-level vision problem, and, in the context of complex natural scenes, we do
not know whether figure-ground organization is even possible from bottom-up.
This leads to the belief that figure-ground organization is merely the result of a
top-down process, after we recognize and understand object layout in a scene.
Recently a large figure-ground
dataset of natural images have been collected and labeled by human
subjects. Such large-scale groundtruth data enable us to quantitatively study
the figure-ground problem in natural images. In this work we develop a
bottom-up figure-ground approach that combines local cues (e.g. convexity) and
global consistency (e.g. T-junction analysis).
We show that there is rich figure-ground information available at
mid-level, both locally and globally. Quantitatively our approach produces
promising figure-ground labelings without recognizing objects or estimating
depth. Hence we "prove" that bottom-up figure-ground organization is feasible.
Such mid-level processing holds great potential for scene understanding and
Local Figure-Ground Cues
The classical Gestalt theory on figure-ground lists a number of "principles" or
cues, such convexity, parallelism, size or symmetry. Many of these cues may be
defined locally, without requiring a full segmentation. It is, however, not
easy to translate these intuitive principles into mathematical definitions.
In this work, we take a learning approach to local figure-ground, by grouping
local shapes into shapemes and collecting
figure-ground statistics for these shape clusters.
Figure 2: shapemes, or local shape clusters, learned from data using the Geometric
Blur descriptor. A simple clustering of local shapes reveals
interesting structures such as convexity, parallelism as well as straight
lines, corners and line endings.
Figure 3: shapemes encode rich figure-ground information. Here are some
shapemes and their figure-ground statistics: we align each shapeme such as the
contour orientation at center is vertical, and count the percentage that the
figure side is to the left. As Gestalt theories predict, parallelism is a
strong figure-ground cue, hence for shapeme 1, the figure is between the two
parallel lines, hence to the left of the center. Convexity is also a strong cue
(shapeme 2), and a straight line gives no information (shapeme 4).
Global Figure-Ground Consistency
Our local model of figure-ground (probabilistically) assigns labels on each
contour. For any valid labeling, when contours join at a junction, their
figure-ground labels need to be consistent with one another, forming
We use a conditional random field to enforce such consistencies at junctions.
Specifically, we enumerate all possible junction labelings, and learn the
weights of each junction type from data. Figure 4 shows examples of some
"likely" junctions and "unlikely" junctions and the learned weights.
Figure 4: learning valid and invalid junctions. Junction 1: continuation of a
contour, "likely". Junction 2: reversal of figure-ground labeling, "unlikely".
Junction 3: cyclic labeling at a 3-junction, "likely". Junction 4: classical T-junction, "likely".
Quantitative performance evaluation
We evaluate the performance of our approach using human-marked figure-ground
labels. We consider three labelings: (1) local cues only; (2) local cues,
averaged over contour segments; (3) local cues plus global consistency.
We observe that local cues perform quite well, even though the natural scenes
in the dataset are fairly complex. If we have a "perfect" segmentation (such as
one marked by human subjects), we may perform global inference on a "perfect"
junction graph. This global consistency inference greatly improves
|Labeling accuracy with groundtruth segmentation
||Local averaged on Contours
||Local + Global
On the other hand, we may apply our approach when there is "no" segmentation.
In this case we compute edges from bottom-up, and form junction structures by
tracing edges. Junction structures are much more noisy, hence the benefit of
global inference much smaller. Nevertheless, we still observe a significant
increase in accuracy.
|Labeling accuracy with bottom-up boundary detection
||Local averaged on Contours
||Local + Global
Sample Results with Groundtruth Segmentation
What we can do with figure-ground labeling if we have "perfect" segmentation. Column
1: images. Column 2: groundtruth figure-ground lables, white being the figure
side and black the ground side. Column 3: results from local cues, red
indicating correct labelings and blue incorrect. Column 4: results from
Sample Results with Bottom-up Boundary Detection
What we can do with figure-ground with "no" segmentation. In this case,
junction structures are derived from bottom-up boundary detection (Column 2).
Local figure-ground cues perform as effectively as before (Column 3). Global
inference about T-junctions is much harder without perfect junctions, but we
still see a significant improvement (Column 4).