Region segmentation is a key problem in mid-level vision and has received a lot of attention in the recent years. In this work, we try to answer the following question:
What is a good segmentation?
Gestalt psychologists have long revealed to us various grouping cues, such as
similarity, continuity, etc. Computationally, however, we do not yet have a
satisfying answer. Most existing algorithms in segmenation, such as the
Normalized Cuts, use hand-constructed criterion and/or hand-picked parameters.
It is our belief that a good answer to
this question relies in the use of human-marked groundtruth data, on which
segmentation algorithms can be rigorously trained and tested.
If we look closely at the classical Gestalt laws of grouping, we can find that
most of them are discriminative in nature. For example, when Wertheimer talked about his law of proximity, he used this picture:
and ask which is a better grouping, ab/cd/ef or a/bc/de/fg ? And proximity tells us that the former is more preferable.
This inspires us to develop a discriminative model of segmentation. Figure
1 shows an example, where "good" segmentations are given by human
subjects, and "bad" segmentations are from random matching of images and masks.
We will use Gestalt grouping cues as features and train a classifier to
distinguish "good" from "bad".
Figure 1: we formulate segmentation as classification between good
segmentations (b) (given by human subjects) and bad segmentations (c) (
random matching of images and masks).
Cues for Grouping
We proprocess images into superpixel maps with 200
superpixels per image. This small number 200 means that we are almost free to
define any feature we like (as we can afford to compute and search over them).
We consider a "good" segment S in a segmentation S, and define the following features for S:
- inter-region contour energy, Eext:
how strong the contour contrast is along the boundary of a segment.
- intra-region contour energy, Eint:
how strong the contour contrast is in the interior of a segment.
- inter-region brightness (dis)similarity, Text:
how similar in brightness a segment is to surrounding segments.
- intra-region brightness similarity, Tint:
how consistent in brightness a segment is in its interior.
- inter-region texture (dis)similarity, Bext:
how similar in texture a segment is to surrounding segments.
- intra-region texture similarity, Bint:
how consistent in texture a segment is in its interior.
- curvilinear continuity, C:
how smooth the boundary of a segment is.
How useful are these features? To quantify this, we can measure the mutual
information between a feature f and the target binary label h
(where h=1 if S is from a good segmentation, 0 otherwise):
|inter-||intra- ||inter-||intra- ||inter-||intra-
We find that: (1) inter-region features are much more informative than
intra-region features, suggesting that discriminative grouping cues are more
useful than generative cues; (2) contour features are the most informative,
followed by texture and brightness; and (3) continuity is a quite useful cue by
itself. Of course these are only marginal information measures; we will look at
cue combination in the next section.
Learning Cue Combination
How to combine these Gestalt cues? First we want to develop some intuition
about the interactions between the features. As we are dealing with a
classification problem, one way to study the data is to plot both the positive
and negative examples in the feature space, and look for a good classification
boundary between them. Figure 2 shows some examples, where we look at
pairs of features, and the distributions are shown as iso-probability contours.
These distributions suggest that:
- the features are relatively well-behaved; for both classes, a Gaussian model would be a reasonable approximation.
- a linear classifier would perform well.
Therefore we define our objective function, i.e. the "goodness" of a segment S, as a linear combination of the features:
G(S) = ∑ cj fj
where the weights cj are learned from logistic regression.
Quantitative evaluations show that such a linear combination captures most of
the information available in these features.
|Figure 2: iso-probability contour plots of empirical distributions for a pair of features.
"goodness" of an entire segmentation S, we
simply add G(S) for all segments S (i.e. assume they are independent).
To actually look for good segmentations given a novel image, we use the
simulated annealing algorithm to optimize the linear objective function.
These results are obtained by combining the linear object function above with a Gaussian prior on the size of segments (estimated from groundtruth as well).
- Learning a Classification Model for Segmentation.
Xiaofeng Ren and Jitendra Malik, in ICCV '03, volume 1, pages 10-17, Nice 2003.