Deploy and Integrate Performance Criteria into Application

Once you identify the optimal configuration of inferences, batch and target for a model, you can incorporate those settings into the inference engine deployed with your application.

How Streams and Batches Impact Performance

Internally, the execution resources are split/pinned into execution streams. This feature provides much better performance for networks that are not originally scaled with a number of threads (for example, lightweight topologies). This is especially pronounced for the many-core server machines. Refer to the Throughput Mode for CPU section in the Optimization Guide for more information.

NOTE: Unlike CPUs and GPUs, VPUs do not support streams. Therefore, on a VPU you can find only optimal inference requests combination. For details, refer to the Performance Aspects of Running Multiple Requests Simultaneously section in the Optimization Guide.

During execution of a model, streams, as well as inference requests in a stream, can be distributed inefficiently among cores of hardware, which can reduce model speed. Using the DL Workbench Inference Results can help optimize performance of your model on specific hardware by providing you with the information you need to manually redistribute streams and inference requests in each stream for each core of the hardware.

NOTE: Inference requests in each stream are parallel.

The optimal configuration is the one with the highest throughput value. Latency, or execution time of an inference, is critical for real-time services. The common technique for improving performance is batching. However, real-time applications often cannot take advantage of batching, because high batch size comes with the latency penalty. With the 2018 R5 release, OpenVINO™ introduced a throughput mode which allows the Inference Engine to efficiently run multiple inference requests simultaneously, greatly improving the throughput.

Discover Optimal Combination of Streams and Batches with DL Workbench

To find an optimal combination of inference requests and batches, follow the steps described in Run Range of Inferences.

The optimal combination is the highest point on the Inference Results graph. However, you can choose to limit latency values by specifying the Latency Threshold value and select an optimal inference among the limited number of inferences:

To view information about latency, throughput, batch, and parallel requests of a specific job, hover your cursor over the corresponding point on the graph:

For details, read Integrate the Inference Engine New Request API with Your Application.

Integrate Optimal Combination into Sample Application

Below is the example on how to build your own application to integrate optimal batch and stream numbers explored with the DL Workbench. Follow the simple steps:

  1. Download a deployment package as described in the Download Deployment Package section of Build Your Application with Deployment Package
  1. Create main.cpp
  2. Create CMakeLists.txt
  3. Compile the application
  4. Run the application with optimal performance criteria

NOTE: The machine where you use the DL Workbench to download the package and where you prepare your own application is a developer machine. The machine where you deploy the application is a target machine.

Create main.cpp

NOTE: Perform this step on your developer machine.

Create a file main.cpp and paste there the code provided below:

View main.cpp

#include <vector>
using namespace InferenceEngine;
int main(int argc, char *argv[]) {
if (argc < 3) {
std::cerr << "Usage: " << argv[0] << " PATH_TO_MODEL_XML DEVICE" << std::endl;
return 1;
int batchSize = 1;
int numInferReq = 1;
if (argc == 5) {
batchSize = std::stoi(argv[3]);
numInferReq = std::stoi(argv[4]);
const std::string modelXml = argv[1];
std::string device = argv[2];
std::transform(device.begin(), device.end(), device.begin(), ::toupper);
Core ie;
// Start setting number of streams
int numStreams = numInferReq;
if (device == "CPU") {
ie.SetConfig({{CONFIG_KEY(CPU_THROUGHPUT_STREAMS), std::to_string(numStreams)}}, device);
if (device == "GPU") {
numStreams = numInferReq / 2;
if (numStreams % 2) {
ie.SetConfig({{CONFIG_KEY(GPU_THROUGHPUT_STREAMS), std::to_string(numStreams)}}, device);
// Finish setting number of streams
CNNNetwork network = ie.ReadNetwork(modelXml);
// Set batch
ExecutableNetwork executableNetwork = ie.LoadNetwork(network, device);
std::vector<InferRequest> requests(numInferReq);
for (std::size_t i = 0; i < numInferReq; i++) {
// Create an InferRequest
requests[i] = executableNetwork.CreateInferRequest();
// run the InferRequest
for (std::size_t i = 0; i < numInferReq; i++){
StatusCode status = requests[i].Wait(IInferRequest::WaitMode::RESULT_READY);
if (status != StatusCode::OK){
std::cout<< "inferRequest " << i << "failed" << std::endl;
std::cout << "Inference completed successfully"<<std::endl;
return 0;

In the code above, the following section sets number of streams for CPU and GPU devices respectively. On a CPU device, the number of streams equals to the number of inference requests. On GPUs, this number is twice smaller for even numbers of inference requests and two times smaller plus one for odd numbers:

int numStreams = numInferReq;
if (device == "CPU") {
ie.SetConfig({{CONFIG_KEY(CPU_THROUGHPUT_STREAMS), std::to_string(numStreams)}}, device);
if (device == "GPU") {
numStreams = numInferReq / 2;
if (numStreams % 2) {
ie.SetConfig({{CONFIG_KEY(GPU_THROUGHPUT_STREAMS), std::to_string(numStreams)}}, device);

The batch size is set with the following line:


Inference requests are created in the section below.

for (std::size_t i = 0; i < numInferReq; i++) {
// Create an InferRequest
requests[i] = executableNetwork.CreateInferRequest();
// run the InferRequest

Create CMakeLists.txt

NOTE: Perform this step on your developer machine.

In the same directory as main.cpp, create a file named CMakeLists.txt with the following commands to compile main.cpp into an executable file:

View CMakeLists.txt

cmake_minimum_required(VERSION 3.10)
set(IE_SAMPLE_NAME ie_sample)
find_package(InferenceEngine 2.1 REQUIRED)
target_link_libraries(${IE_SAMPLE_NAME} PUBLIC ${InferenceEngine_LIBRARIES})

Compile Application

NOTE: Perform this step on your developer machine.

Open a terminal in the directory with main.cpp and CMakeLists.txt, run the following commands to build the sample:

NOTE: Replace <INSTALL_OPENVINO_DIR> with the directory you installed the OpenVINO™ package in. By default, the package is installed to /opt/intel/openvino or ~/intel/openvino.

mkdir build
cd build
cmake ../

Once the commands are executed, find the ie_sample binary in the build folder in the directory with the source files.

Run Application

Step 1. Make sure you have the following components on your developer machine:

  • Deployment package
  • Model (if it is not included in the package)
  • Binary file with your application, ie_sample for example

Step 2. Unarchive the deployment package. Place the binary and model inside the deployment_package folder as follows:

|-- deployment_package
|-- bin
|-- deployment_tools
|-- install_dependencies
|-- model
|-- model.xml
|-- model.bin
|-- ie_sample

Step 3. Archive the deployment_package folder and copy it to the target machine.

NOTE: Perform the steps below on your target machine.

Step 4. Open a terminal in the deployment_package folder on the target machine.

Step 5. Optional: for inference on Intel® GPU, Intel® Movidius™ VPU, or Intel® Vision Accelerator Design with Intel® Movidius™ VPUs targets.
Install dependencies by running the script:

sudo -E ./install_dependencies/

Step 6. Set up the environment variables by running bin/

source ./bin/

Step 7. Run the application specifying the path to your model, target device, number of batches, and number of streams. In our example, we identified the number of batches equal to 4 and number of streams equal to 2 as an optimal combination for the squeezenet1.0 model on a GPU device, so we pass these values to the command:

./ie_sample <path>/<model>.xml GPU 4 2


  • The order of variables is important here.
  • Replace <path> and <model> with the path to your model and its name.
  • In the command above, the application is run on a CPU device. See the Supported Inference Devices section of Install DL Workbench for code names of other devices.

Step 8. Once you run the application, you get the following output:

Inference completed successfully

See Also

CNNNetwork ReadNetwork(const std::string &modelPath, const std::string &binPath={}) const
Reads models from IR and ONNX formats.
InferRequest CreateInferRequest()
Creates an inference request object used to infer the network.
Definition: ie_executable_network.hpp:107
This class represents Inference Engine Core entity.
Definition: ie_core.hpp:29
This class contains all the information about the Neural Network and the related binary information.
Definition: ie_cnn_network.h:35
#define CONFIG_KEY(name)
shortcut for defining configuration keys
Definition: ie_plugin_config.hpp:170
A header file that provides a set minimal required Inference Engine API.
This enum contains codes for all possible return values of the interface functions.
Definition: ie_common.h:208
void StartAsync()
Start inference of specified input(s) in asynchronous mode.
Definition: ie_infer_request.hpp:214
wrapper over IExecutableNetwork
Definition: ie_executable_network.hpp:30
void SetConfig(const std::map< std::string, std::string > &config, const std::string &deviceName={})
Sets configuration for device, acceptable keys can be found in ie_plugin_config.hpp.
ExecutableNetwork LoadNetwork(const CNNNetwork &network, const std::string &deviceName, const std::map< std::string, std::string > &config={})
Creates an executable network from a network object.