Figure/ground organization, the binding of contours to surfaces, is a classical
problem in vision. In this work we study a simplified task of figure/ground
labeling in which the goal is to label every pixel as belonging to either a
figural object or background. Our goal is to understand the role of different
cues in this process, including low-level cues, such as edge contrast and
texture similarity; mid-level cues, such as curvilinear continuity; and
high-level cues, such as characteristic shape or texture of the object.
We develop a conditional random field model over edges, regions and objects to
integrate these cues. This random field model is built upon the CDT graph, a
discrete scale-invariant image representation we have recently developed. We
train the model from human-marked groundtruth labels and quantify the relative
contributions of each cue on a large collection of horse images.
We have previously applied this CDT/CRF framework to the problem of contour
grouping/completion. Here we extend the approach in a few key directions:
- We extend the framework to joint modeling and inference of both contours
and regions, hence allowing a much richer set of cues be to incorporated and
studied. The output of our model is now the posterior marginal distributions of
both boundary contours and figure regions.
- We include high-level knowledge into the grouping mechanism, including
shape, texture and regional support. As we will show, such object-specific knowledge
greatly improves the performance of grouping.
Cues for Figure/Ground Labeling
We study the interactions of figure/ground cues at three distinct levels:
low-level cues, which can be computed in local neighborhoods; mid-level cues,
which encodes generic relations between elements without object knowledge; and
then high-level cues, which are specific to a object category.
We define these cues on top of the CDT graph, where
each edge in the triangulation is associated with a binary random variable
Xe, and each triangle a binary variable Yt. Each cue is a
constraint on a subset of these random variables, as outlined below.
L1: edge energy along an edge e.
L2: brightness/texture similarity between two regions s and t.
colinearity and junction frequency at vertex V.
M2: consistency of edge labels and adjoining region labels.
H1: similarity of a region t to exemplar texture.
H2: compatibility of local region support with pose.
H3: ompatibility of local edge shape with pose.
|Performance evaluation on the horse dataset: (a) precision-recall
curves for horse boundaries, models with low-level cues only (Pb), low-
plus mid-level cues (Pb+M), low- plus high-level cues (Pb+H), and
all three classes of cues combined (Pb+M+H). The F-measure
recorded in the legend is the maximal harmonic mean of precision and recall and
provides an overall ranking. Using high-level cues greatly improves the
boundary detection performance. Mid-level continuity cues are useful with or
without high-level cues. (b) precision-recall for regions. The poor performance
of the baseline L+M model indicates the ambiguity of figure/ground
labeling at low-level despite successful boundary detection. High-level shape
knowledge is the key, consistent with evidence from psychophysics [Peterson and
Gibson 1994]. In both boundary and region cases, the groundtruth labels on CDTs
are nearly perfect, indicating that the CDT graphs preserve most of the image
- Cue Integration in Figure/Ground Labeling.
Xiaofeng Ren, Charless Fowlkes and Jitendra Malik, in NIPS '05, Vancouver 2005.