For some applications it isn't adequate enough to localize an object with a simple bounding box. For instance, you might want to segment an object region once it is detected. This class of problems is called instance segmentation.
Instance segmentation is an extension of object detection, where a binary mask
(i.e. object vs. background) is associated with every bounding box. This allows
for more fine-grained information about the extent of the object within the box.
To train an instance segmentation model, a groundtruth mask must be supplied for
every groundtruth bounding box. In additional to the proto fields listed in the
section titled Using your own dataset, one must
also supply image/object/mask
, which can either be a repeated list of
single-channel encoded PNG strings, or a single dense 3D binary tensor where
masks corresponding to each object are stacked along the first dimension. Each
is described in more detail below.
Instance segmentation masks can be supplied as serialized PNG images.
image/object/mask = ["\x89PNG\r\n\x1A\n\x00\x00\x00\rIHDR\...", ...]
These masks are whole-image masks, one for each object instance. The spatial dimensions of each mask must agree with the image. Each mask has only a single channel, and the pixel values are either 0 (background) or 1 (object mask). PNG masks are the preferred parameterization since they offer considerable space savings compared to dense numerical masks.
Masks can also be specified via a dense numerical tensor.
image/object/mask = [0.0, 0.0, 1.0, 1.0, 0.0, ...]
For an image with dimensions H
x W
and num_boxes
groundtruth boxes, the
mask corresponds to a [num_boxes
, H
, W
] float32 tensor, flattened into a
single vector of shape num_boxes
* H
* W
. In TensorFlow, examples are read
in row-major format, so the elements are organized as:
... mask 0 row 0 ... mask 0 row 1 ... // ... mask 0 row H-1 ... mask 1 row 0 ...
where each row has W contiguous binary values.
To see an example tf-records with mask labels, see the examples under the Preparing Inputs section.
We provide four instance segmentation config files that you can use to train your own models:
For more details see the detection model zoo.
Currently, the only supported instance segmentation model is Mask R-CNN, which requires Faster R-CNN as the backbone object detector.
Once you have a baseline Faster R-CNN pipeline configuration, you can make the following modifications in order to convert it into a Mask R-CNN model.
train_input_reader
and eval_input_reader
, set
load_instance_masks
to True
. If using PNG masks, set mask_type
to
PNG_MASKS
, otherwise you can leave it as the default 'NUMERICAL_MASKS'.faster_rcnn
config, use a MaskRCNNBoxPredictor
as the
second_stage_box_predictor
.MaskRCNNBoxPredictor
message, set predict_instance_masks
to
True
. You must also define conv_hyperparams
.faster_rcnn
message, set number_of_stages
to 3
.'coco_mask_metrics'
.input_path
s to point at your data.Please refer to the section on Running the pets dataset for additional details.
Note: The mask prediction branch consists of a sequence of convolution layers. You can set the number of convolution layers and their depth as follows:
- Within the
MaskRCNNBoxPredictor
message, set themask_prediction_conv_depth
to your value of interest. The default value is 256. If you set it to0
(recommended), the depth is computed automatically based on the number of classes in the dataset.- Within the
MaskRCNNBoxPredictor
message, set themask_prediction_num_conv_layers
to your value of interest. The default value is 2.