This topic demonstrates how to use the Benchmark C++ Tool to estimate deep learning inference performance on supported devices. Performance can be measured for two inference modes: synchronous (latency-oriented) and asynchronous (throughput-oriented).
NOTE: This topic describes usage of C++ implementation of the Benchmark Tool. For the Python* implementation, refer to Benchmark Python* Tool.
Upon start-up, the application reads command-line parameters and loads a network and images/binary files to the Inference Engine plugin, which is chosen depending on a specified device. The number of infer requests and execution approach depend on the mode defined with the
-api command-line parameter.
NOTE: By default, Inference Engine samples, tools and demos expect input with BGR channels order. If you trained your model to work with RGB order, you need to manually rearrange the default channels order in the sample or demo application or reconvert your model using the Model Optimizer tool with
--reverse_input_channelsargument specified. For more information about the argument, refer to When to Reverse Input Channels section of Converting a Model Using General Conversion Parameters.
If you run the application in the synchronous mode, it creates one infer request and executes the
Infer method. If you run the application in the asynchronous mode, it creates as many infer requests as specified in the
-nireq command-line parameter and executes the
StartAsync method for each of them. If
-nireq is not set, the application will use the default value for specified device.
A number of execution steps is defined by one of the following parameters:
-tare not specified. Predefined duration value depends on a device.
During the execution, the application collects latency for each executed infer request.
Reported latency value is calculated as a median value of all collected latencies. Reported throughput value is reported in frames per second (FPS) and calculated as a derivative from:
Throughput value also depends on batch size.
The application also collects per-layer Performance Measurement (PM) counters for each executed infer request if you enable statistics dumping by setting the
-report_type parameter to one of the possible values:
no_countersreport includes configuration options specified, resulting FPS and latency.
average_countersreport extends the
no_countersreport and additionally includes average PM counters values for each layer from the network.
detailed_countersreport extends the
average_countersreport and additionally includes per-layer PM counters and latency for each executed infer request.
Depending on the type, the report is stored to
benchmark_detailed_counters_report.csv file located in the path specified in
The application also saves executable graph information serialized to an XML file if you specify a path to it with the
Note that the benchmark_app usually produces optimal performance for any device out of the box.
So in most cases you don't need to play the app options explicitly and the plain device name is enough, for example, for CPU:
But it is still may be non-optimal for some cases, especially for very small networks. More details can read in Introduction to Performance Topics.
As explained in the Introduction to Performance Topics section, for all devices, including new MULTI device it is preferable to use the FP16 IR for the model. Also if latency of the CPU inference on the multi-socket machines is of concern, please refer to the same Introduction to Performance Topics document.
Running the application with the
-h option yields the following usage message:
Running the application with the empty list of options yields the usage message given above and an error message.
Application supports topologies with one or more inputs. If a topology is not data-sensitive, you can skip the input parameter. In this case, inputs are filled with random values. If a model has only image input(s), please provide a folder with images or a path to an image as input. If a model has some specific input(s) (not images), please prepare a binary file(s) that is filled with data of appropriate precision and provide a path to them as input. If a model has mixed input types, input folder should contain all required files. Image inputs are filled with image files one by one. Binary inputs are filled with binary inputs one by one.
NOTE: Before running the tool with a trained model, make sure the model is converted to the Inference Engine format (*.xml + *.bin) using the Model Optimizer tool.
The sample accepts models in ONNX format (.onnx) that do not require preprocessing.
This section provides step-by-step instructions on how to run the Benchmark Tool with the
googlenet-v1 public model on CPU or FPGA devices. As an input, the
car.png file from the
<INSTALL_DIR>/deployment_tools/demo/ directory is used.
NOTE: The Internet access is required to execute the following steps successfully. If you have access to the Internet through the proxy server only, please make sure that it is configured in your OS environment.
downloader.pyscript with specifying the model name and directory to download the model to:
mo.pyscript with specifying the path to the model, model format (which must be FP32 for CPU and FPG) and output directory to generate the IR files:
<INSTALL_DIR>/deployment_tools/demo/car.pngfile as an input image, the IR of the
googlenet-v1model and a device to perform inference on. The following commands demonstrate running the Benchmark Tool in the asynchronous mode on CPU and FPGA devices:
The application outputs the number of executed iterations, total duration of execution, latency, and throughput. Additionally, if you set the
-report_type parameter, the application outputs statistics report. If you set the
-pc parameter, the application outputs performance counters. If you set
-exec_graph_path, the application reports executable graph information serialized. All measurements including per-layer PM counters are reported in milliseconds.
Below are fragments of sample output for CPU and FPGA devices: