Converting Faster R-CNN models, created with TensorFlow Object Detection API

This is a deprecated page. Please, consider reading this page describing new approach to convert Object Detection API models giving closer to TensorFlow inference results.

Converting models created with TensorFlow Object Detection API version equal or higher than 1.6.0

This chapter describes how to convert selected Faster R-CNN models from the TensorFlow Object Detection API zoo version equal or higher than 1.6.0. The full list of supported models is provided in the table below. Note that currently batch size 1 is supported only. The only Inference Engine plugin supporting these topologies inference is CPU.

The Faster R-CNN models contain several building blocks similar to building blocks from SSD models so it is highly recommended to read chapter about enabling TensorFlow Object Detection API SSD models first. Detailed information about Faster R-CNN topologies is provided here.

The TensorFlow network consists of a number of big blocks grouped by scope:

  • Preprocessor performs scaling/resizing of the image and converts input data to [0, 1] interval. Has two outputs: the first one is modified input image and the second one is a constant tensor with shape (batch_size, 3) and values (resized_image_height, resized_image_width, 3).
  • FirstStageFeatureExtractor is a backbone feature extractor.
  • FirstStageBoxPredictor calculates boxes and classes predictions.
  • GridAnchorGenerator generates anchors coordinates.
  • ClipToWindow crops anchors to the resized image size.
  • Decode decodes coordinates of boxes using anchors and data from the FirstStageBoxPredictor.
  • BatchMultiClassNonMaxSuppression performs non maximum suppression.
  • map scales coordinates of boxes to [0, 1] interval by dividing coordinates by (resized_image_height, resized_image_width).
  • map_1 scales coordinates from [0, 1] interval to resized image sizes.
  • SecondStageFeatureExtractor is a feature extractor for predicted Regions of interest (ROIs).
  • SecondStageBoxPredictor refines box coordinates according SecondStageFeatureExtractor.
  • SecondStagePostprocessor is Detection Output layer performing final boxes predictions.

Sub-graph replacements

There are three sub-graph replacements defined in the extensions/front/tf/legacy_faster_rcnn_support.json used to convert these models:

  • the first one replaces the Preprocessor block. The implementation of this replacer is in the <INSTALL_DIR>/deployment_tools/model_optimizer/extensions/front/tf/Preprocessor.py
  • the second one replaces a number of blocks in the the graph including GridAnchorGenerator, ClipToWindow, Decode, BatchMultiClassNonMaxSuppression, Tile, Tile_1 and map with Proposal and ROIRooling layers and some additional layers to pre-process input data
  • the third one replaces SecondStagePostprocessor with a DetectionOutput layer.

The second replacer is defined using the following configuration that matches sub-graph by points:

{
"custom_attributes": {
"nms_threshold": 0.7,
"feat_stride": 16,
"max_proposals": 100,
"anchor_base_size": 256,
"anchor_scales": [0.25, 0.5, 1.0, 2.0],
"anchor_aspect_ratios": [0.5, 1.0, 2.0],
"roi_spatial_scale": 0.0625
},
"id": "TFObjectDetectionAPIFasterRCNNProposalAndROIPooling",
"include_inputs_to_sub_graph": true,
"include_outputs_to_sub_graph": true,
"instances": {
"end_points": [
"CropAndResize",
"map_1/TensorArrayStack/TensorArrayGatherV3",
"map_1/while/strided_slice/Enter",
"BatchMultiClassNonMaxSuppression/map/TensorArrayStack_4/TensorArrayGatherV3"
],
"start_points": [
"FirstStageBoxPredictor/concat",
"FirstStageBoxPredictor/concat_1",
"GridAnchorGenerator/Identity",
"Shape",
"CropAndResize"
]
},
"match_kind": "points"
}

The start_points list contains the following nodes:

  • FirstStageBoxPredictor/concat node produces box coordinates predictions.
  • FirstStageBoxPredictor/concat_1 node produces classes predictions which will be used for the ROIs
  • GridAnchorGenerator/Identity node produces anchors coordinates.
  • Shape and CropAndResize nodes are specified as inputs to correctly isolate the required sub-graph. Refer to the chapter for more information about replacements by points.

The end_points list contains the following nodes:

  • CropAndResize is the node that performs ROI pooling operation.
  • map_1/TensorArrayStack/TensorArrayGatherV3, map_1/while/strided_slice/Enter and BatchMultiClassNonMaxSuppression/map/TensorArrayStack_4/TensorArrayGatherV3 are specified to correctly isolate the sub-graph.

The custom_attributes dictionary contains attributes where most values are taken from the topology-specific configuration file samples/configs/faster_rcnn_*.config of the TensorFlow Object Detection API repository:

  • nms_threshold is the value of the first_stage_nms_iou_threshold parameter.
  • feat_stride is the value of the height_stride and width_stride parameters. Inference Engine supports case when these two values are equal that is why the replacement configuration file contains just one parameter.
  • max_proposals is the value of the max_total_detections parameter which is a maximum number of proposal boxes from the Proposal layer and detected boxes.
  • anchor_base_size is the base size of the generated anchor. The 256 is the default value for this parameter and it is not specified in the configuration file.
  • anchor_scales" is the value of thescales` attrbite.
  • anchor_aspect_ratios is the value of the aspect_ratios attribute.
  • roi_spatial_scale is needed for the Inference Engine ROIPooling layer. It is the default value that is not actually used.

The identifier for this replacer is TFObjectDetectionAPIFasterRCNNProposalAndROIPooling. The Python implementation of this replacer is in the file <INSTALL_DIR>/deployment_tools/model_optimizer/extensions/front/tf/FasterRCNNs.py.

The first four functions of the replacer class are the following:

class TFObjectDetectionAPIFasterRCNNProposalAndROIPooling(FrontReplacementFromConfigFileSubGraph):
"""
This class replaces sub-graph of operations with Proposal and ROIPooling layers and additional layers transforming
tensors from layout of TensorFlow to layout required by Inference Engine.
Refer to comments inside the function for more information about performed actions.
"""
replacement_id = 'TFObjectDetectionAPIFasterRCNNProposalAndROIPooling'
def run_after(self):
return [PreprocessorReplacement]
def run_before(self):
return [SecondStagePostprocessorReplacement]
def output_edges_match(self, graph: nx.DiGraph, match: SubgraphMatch, new_sub_graph: dict):
return {match.output_node(0)[0].id: new_sub_graph['roi_pooling_node'].id}
def nodes_to_remove(self, graph: nx.MultiDiGraph, match: SubgraphMatch):
new_list = match.matched_nodes_names().copy()
# do not remove nodes that produce box predictions and class predictions
new_list.remove(match.single_input_node(0)[0].id)
new_list.remove(match.single_input_node(1)[0].id)
return new_list

The function run_after returns list of Python classes inherited from one of the replacer classes (FrontReplacementOp, FrontReplacementPattern, FrontReplacementFromConfigFileSubGraph etc) those current sub-graph replacement class must be run after. In this case the replacer must be run after the Preprocessor is removed by the PreprocessorReplacement replacer. Similar way the run_before function is used to tell Model Optimizer to execute SecondStagePostprocessorReplacement before this replacer.

The output_edges_match function describes matching between the output nodes of the sub-graph before replacement and after. In this case the only needed output node of the sub-graph is the CropAndResize node which is identified with match.output_node(0)[0]. The new output node which is created in the generate_sub_graph function is identified with ‘new_sub_graph['roi_pooling_node’]`.

The nodes_to_remove function takes the default list of nodes to be removed which contains all matched nodes and remove from them two input nodes which are identified with match.single_input_node(0)[0] and match.single_input_node(1)[0]. These nodes will be connected as inputs to new nodes being generated in the generate_sub_graph function so they should node be removed.

The code generating new sub-graph is the following:

def generate_sub_graph(self, graph: nx.MultiDiGraph, match: SubgraphMatch):
log.debug('TFObjectDetectionAPIFasterRCNNProposal: matched_nodes = {}'.format(match.matched_nodes_names()))
config_attrs = match.custom_replacement_desc.custom_attributes
nms_threshold = config_attrs['nms_threshold']
feat_stride = config_attrs['feat_stride']
max_proposals = config_attrs['max_proposals']
anchor_base_size = config_attrs['anchor_base_size']
roi_spatial_scale = config_attrs['roi_spatial_scale']
proposal_ratios = config_attrs['anchor_aspect_ratios']
proposal_scales = config_attrs['anchor_scales']
anchors_count = len(proposal_ratios) * len(proposal_scales)

These lines get parameters defined in the sub-graph replacement configuration file and calculate initial anchors count.

# get the ROIPool size from the CropAndResize which performs the same action
if 'CropAndResize' not in graph.nodes():
raise Error('Failed to find node with name "CropAndResize" in the topology. Probably this is not Faster'
' RCNN topology or it is not supported')
roi_pool_size = Node(graph, 'CropAndResize').in_node(3).value[0]

The code above gets the ROI Pooling spatial output dimension size as a value from the fourth argument of the node with name CropAndResize.

# Convolution/matmul node that produces classes predictions
# Permute result of the tensor with classes permissions so it will be in a correct layout for Softmax
predictions_node = match.single_input_node(1)[0].in_node(0).in_node(0)
permute_predictions_op = Permute(graph, {'order': np.array([0, 2, 3, 1])})
permute_predictions_node = permute_predictions_op.create_node([], dict(name=predictions_node.name + '/Permute_'))
insert_node_after(predictions_node, permute_predictions_node, 0)
reshape_classes_op = Reshape(graph, {'dim': np.array([0, -1, 2])})
reshape_classes_node = reshape_classes_op.create_node([permute_predictions_node],
dict(name='Reshape_FirstStageBoxPredictor_Class_'))
update_attrs(reshape_classes_node, 'shape_attrs', 'dim')
softmax_conf_op = Softmax(graph, {'axis': 1})
softmax_conf_node = softmax_conf_op.create_node([reshape_classes_node],
dict(name='FirstStageBoxPredictor_SoftMax_Class_'))

The output with class predictions from the FirstStageBoxPredictor is generated with a convolution operation. The convolution output data layout in TensorFlow is NHWC while Inference Engine uses NCHW layout. Model Optimizer by default converts the weights of TensorFlow convolutions to produce output tensor in NCHW layout required by Inference Engine. The issue arises because the class predictions tensor is passed through the Softmax operation to produce class probabilities. The Inference Engine Softmax is performed over the fastest-changing dimension which is 'W' in Inference Engine. Thus, the softmax operation will be performed over a wrong dimension after conversion of the convolution node producing classes predicitions. The solution is to add Permute and Reshape operations to prepare the input data for Softmax. The Reshape operation is required to make the size of the fastest-changing dimension equal to 2, because there are 2 classes being predicted: background and foreground.

Another issue is that layout of elements in the predicted classes tensor is different between TensorFlow and Inference Engine Proposal layer requirements. In TensorFlow the tensor has the following virtual layout [N, H, W, num_anchors, num_classes] while the Inference Engine Proposal layer requires in the following virtual layout [N, num_classes, num_anchors, H, W]. Thus, it is necessary to reshape, permute and then reshape again output from the Softmax to the required shape for the Proposal layer:

reshape_softmax_op = Reshape(graph, {'dim': np.array([1, anchors_count, 2, -1])})
reshape_softmax_node = reshape_softmax_op.create_node([softmax_conf_node], dict(name='Reshape_Softmax_Class_'))
update_attrs(reshape_softmax_node, 'shape_attrs', 'dim')
permute_reshape_softmax_op = Permute(graph, {'order': np.array([0, 1, 3, 2])})
permute_reshape_softmax_node = permute_reshape_softmax_op.create_node([reshape_softmax_node],
dict(name='Permute_'))
# implement custom reshape infer function because we need to know the input convolution node output dimension
# sizes but we can know it only after partial infer
reshape_permute_op = Reshape(graph, {'dim': np.ones([4]), 'anchors_count': anchors_count,
'conv_node': predictions_node})
reshape_permute_op.attrs['old_infer'] = reshape_permute_op.attrs['infer']
reshape_permute_op.attrs['infer'] = __class__.classes_probabilities_reshape_shape_infer
reshape_permute_node = reshape_permute_op.create_node([permute_reshape_softmax_node],
dict(name='Reshape_Permute_Class_'))
update_attrs(reshape_permute_node, 'shape_attrs', 'dim')

The Proposal layer has 3 inputs: classes probabilities, boxes predictions and a input shape of the image. The first two tensors are ready so it is necessary to create the Const operation that produces the desired third input tensor.

# create constant input with the image height, width and scale H and scale W (if present) required for Proposal
const_value = np.array([[input_height, input_width, 1]], dtype=np.float32)
const_op = Const(graph, dict(value=const_value, shape=const_value.shape))
const_node = const_op.create_node([], dict(name='Proposal_const_image_size_'))

Now add the Proposal layer:

proposal_op = ProposalOp(graph, dict(min_size=10, framework='tensorflow', box_coordinate_scale=10,
box_size_scale=5, post_nms_topn=max_proposals, feat_stride=feat_stride,
ratio=proposal_ratios, scale=proposal_scales, base_size=anchor_base_size,
pre_nms_topn=2**31 - 1,
nms_thresh=nms_threshold))
proposal_node = proposal_op.create_node([reshape_permute_node,
match.single_input_node(0)[0].in_node(0).in_node(0),
const_node],
dict(name=proposal_op.attrs['type'] + '_'))

The box coordinates in the TensorFlow are in the following layout "YXYX" while Inference Engine uses "XYXY" layout so it is necessary to swap coordinates produced by Proposal layer. It is implemented with help of a convolution node with a special filter of a size [5, 5]:

proposal_reshape_4d_op = Reshape(graph, {'dim': np.array([max_proposals, 1, 1, 5])})
proposal_reshape_4d_node = proposal_reshape_4d_op.create_node([proposal_node], dict(name="reshape_4d_"))
update_attrs(proposal_reshape_4d_node, 'shape_attrs', 'dim')
# create convolution node to swap X and Y coordinates in the proposals
conv_filter_const_data = np.array(np.array([[1, 0, 0, 0, 0],
[0, 0, 1, 0, 0],
[0, 1, 0, 0, 0],
[0, 0, 0, 0, 1],
[0, 0, 0, 1, 0]],
dtype=np.float32).reshape([1, 1, 5, 5]), dtype=np.float32)
conv_filter_const_op = Const(graph, dict(value=conv_filter_const_data, spatial_dims=np.array([2, 3])))
conv_filter_const_node = conv_filter_const_op.create_node([], dict(name="conv_weights"))
conv_op = Op(graph, {
'op': 'Conv2D',
'bias_addable': False,
'spatial_dims': np.array([1, 2]),
'channel_dims': np.array([3]),
'batch_dims': np.array([0]),
'pad': None,
'pad_spatial_shape': None,
'input_feature_channel': 2,
'output_feature_channel': 2,
'output_shape': [max_proposals, 1, 1, 5],
'dilation': np.array([1, 1, 1, 1], dtype=np.int64),
'stride': np.array([1, 1, 1, 1]),
'type': 'Convolution',
'group': None,
'layout': 'NHWC',
'infer': __class__.fake_conv_shape_infer})
predictions_node = conv_op.create_node([proposal_reshape_4d_node, conv_filter_const_node], dict(name="conv_"))
update_ie_fields(graph.node[predictions_node.id])
proposal_reshape_2d_op = Reshape(graph, {'dim': np.array([max_proposals, 5])})
proposal_reshape_2d_node = proposal_reshape_2d_op.create_node([predictions_node], dict(name="reshape_2d_"))
# set specific name for this Reshape operation so we can use it in the DetectionOutput replacer
proposal_reshape_2d_node['name'] = 'swapped_proposals'

The ROIPooling layer in TensorFlow is implemented with operation called CropAndResize with bi-linear filtration. Inference Engine implementation of the ROIPooling layer with bi-linear filtration requires input boxes coordinates be scaled to [0, 1] interval. Adding elementwise multiplication of box coordinates solves this issue:

# the TF implementation of Proposal with bi-linear filtration need proposals scaled by image size
proposal_scale_const = np.array([1.0, 1 / input_height, 1 / input_width, 1 / input_height, 1 / input_width],
dtype=np.float32)
proposal_scale_const_op = Const(graph, dict(value=proposal_scale_const, shape=proposal_scale_const.shape))
proposal_scale_const_node = proposal_scale_const_op.create_node([], dict(name='Proposal_scale_const_'))
scale_proposals_op = Eltwise(graph, {'operation': 'mul'})
scale_proposals_node = scale_proposals_op.create_node([proposal_reshape_2d_node, proposal_scale_const_node],
dict(name='scale_proposals_'))

The last step is to create the ROIPooling node with 2 inputs: the identified feature maps from the FirstStageFeatureExtractor and the scaled output of the Proposal layer:

feature_extractor_output_nodes = scope_output_nodes(graph, 'FirstStageFeatureExtractor')
if len(feature_extractor_output_nodes) != 1:
raise Error("Failed to determine FirstStageFeatureExtractor output node to connect it to the ROIPooling."
"Found the following nodes: {}".format([node.name for node in feature_extractor_output_nodes]))
roi_pooling_op = ROIPooling(graph, dict(method="bilinear", framework="tensorflow",
pooled_h=roi_pool_size, pooled_w=roi_pool_size,
spatial_scale=roi_spatial_scale))
roi_pooling_node = roi_pooling_op.create_node([feature_extractor_output_nodes[0], scale_proposals_node],
dict(name='ROI_Pooling_'))
return {'roi_pooling_node': roi_pooling_node}

The are two additional methods implemented in the replacer class:

  • The fake_conv_shape_infer is a silly infer function for the convolution that permutes X and Y coordinates of the Proposal output which avoids setting a lot of internal attributes required for propoper shape inference.
  • The "classes_probabilities_reshape_shape_infer" function is used to update the output dimension of the reshape operation. The output spatial dimensions depends on the convolution output spatial dimensions thus they are not known until the shape inference pass which is performed after this sub-graph replacement class. So this custom infer function is called instead of default Reshape shape inference function, updates the required attribute "dim" of the node with the convolution output spatial dimensions which are known at the time of calling this inference function and then call the default Reshape inference function.
@staticmethod
def fake_conv_shape_infer(node: Node):
node.out_node(0).shape = node.in_node(0).shape
# call functions to update internal attributes required for correct IR generation
mark_input_bins(node)
assign_dims_to_weights(node.in_node(1), [0, 1], node.input_feature_channel, node.output_feature_channel, 4)
@staticmethod
def classes_probabilities_reshape_shape_infer(node: Node):
# now we can determine the reshape dimensions from Convolution node
conv_node = node.conv_node
conv_output_shape = conv_node.out_node().shape
# update desired shape of the Reshape node
node.dim = np.array([0, conv_output_shape[1], conv_output_shape[2], node.anchors_count * 2])
node.old_infer(node)

The second replacer defined in the sub-graph replacement configuration file replaces the SecondStagePostprocessor block and is defined using scope:

{
"custom_attributes": {
"code_type": "caffe.PriorBoxParameter.CENTER_SIZE",
"confidence_threshold": 0.01,
"keep_top_k": 300,
"nms_threshold": 0.6,
"pad_mode": "caffe.ResizeParameter.CONSTANT",
"resize_mode": "caffe.ResizeParameter.WARP",
"max_detections_per_class": 100,
"num_classes": 90
},
"id": "SecondStagePostprocessorReplacement",
"inputs": [
[
{
"node": "Reshape$",
"port": 0
}
],
[
{
"node": "Reshape_1$",
"port": 0
}
],
[
{
"node": "ExpandDims$",
"port": 0
}
]
],
"instances": [
".*SecondStagePostprocessor/"
],
"match_kind": "scope",
"outputs": [
{
"node": "BatchMultiClassNonMaxSuppression/map/TensorArrayStack/TensorArrayGatherV3$",
"port": 0
}
]
}

The replacement code is similar to the SecondStagePostprocessor replacement for the SSDs topologies. The are two major difference:

  • The tensor with bounding boxes doesn't contain locations for class 0 (background class) but Inference Engine Detection Output layer requires it. The Const node with some dummy values are created and concatenated with the tensor.
  • The priors tensor is not constant like in SSDs so the bounding boxes tensor must be scaled with variances [0.1, 0.1, 0.2, 0.2].

The descibed above difference are resolved with the following code:

# TF produces locations tensor without boxes for background.
# Inference Engine DetectionOutput layer requires background boxes so we generate them with some values
# and concatenate with locations tensor
fake_background_locs_blob = np.tile([[[1, 1, 2, 2]]], [max_detections_per_class, 1, 1])
fake_background_locs_const_op = Const(graph, dict(value=fake_background_locs_blob,
shape=fake_background_locs_blob.shape))
fake_background_locs_const_node = fake_background_locs_const_op.create_node([])
reshape_loc_op = Reshape(graph, {'dim': np.array([max_detections_per_class, num_classes, 4])})
reshape_loc_node = reshape_loc_op.create_node([match.single_input_node(0)[0].in_node(0)],
dict(name='Reshape_loc_'))
concat_loc_op = Concat(graph, {'axis': 1})
concat_loc_node = concat_loc_op.create_node([fake_background_locs_const_node, reshape_loc_node],
dict(name='Concat_fake_loc_'))
# blob with variances
variances_blob = np.array([0.1, 0.1, 0.2, 0.2])
variances_const_op = Const(graph, dict(value=variances_blob, shape=variances_blob.shape))
variances_const_node = variances_const_op.create_node([])
# reshape locations tensor to 2D so it could be passed to Eltwise which will be converted to ScaleShift
reshape_loc_2d_op = Reshape(graph, {'dim': np.array([-1, 4])})
reshape_loc_2d_node = reshape_loc_2d_op.create_node([concat_loc_node], dict(name='reshape_locs_2d_'))
# element-wise multiply locations with variances
eltwise_locs_op = Eltwise(graph, {'operation': 'mul'})
eltwise_locs_node = eltwise_locs_op.create_node([reshape_loc_2d_node, variances_const_node],
dict(name='scale_locs_'))

Example of Model Optimizer Command-Line for TensorFlow's Faster R-CNNs

The final command line to convert Faster R-CNNs from the TensorFlow* Object Detection Zoo is the following:

./mo.py --input_model=<path_to_frozen.pb> --output=detection_boxes,detection_scores,num_detections --tensorflow_use_custom_operations_config extensions/front/tf/legacy_faster_rcnn_support.json

Note that there are minor changes that should be made to the and sub-graph replacement configuration file <INSTALL_DIR>/deployment_tools/model_optimizer/extensions/front/tf/legacy_faster_rcnn_support.json before converting particular Faster R-CNN topology. Refer to the table below.

Sub-Graph Replacement Configuration File Parameters to Convert Different Faster R-CNN Models

Model Name Configuration File Changes
faster_rcnn_inception_v2_coco None
faster_rcnn_resnet50_coco None
faster_rcnn_resnet50_lowproposals_coco None
faster_rcnn_resnet101_coco None
faster_rcnn_resnet101_lowproposals_coco None
faster_rcnn_inception_resnet_v2_atrous_coco "feat_stride: 8"
faster_rcnn_inception_resnet_v2_atrous_lowproposals_coco "feat_stride: 8"