Once you identify the optimal configuration of inferences, batch and target for a model, you can incorporate those settings into the inference engine deployed with your application.
Internally, the execution resources are split/pinned into execution streams. This feature provides much better performance for networks that are not originally scaled with a number of threads (for example, lightweight topologies). This is especially pronounced for the many-core server machines. Refer to the Throughput Mode for CPU section in the Optimization Guide for more information.
NOTE: Unlike CPUs and GPUs, VPUs do not support streams. Therefore, on a VPU you can find only optimal inference requests combination. For details, refer to the Performance Aspects of Running Multiple Requests Simultaneously section in the Optimization Guide.
During execution of a model, streams, as well as inference requests in a stream, can be distributed inefficiently among cores of hardware, which can reduce model speed. Using the DL Workbench Inference Results can help optimize performance of your model on specific hardware by providing you with the information you need to manually redistribute streams and inference requests in each stream for each core of the hardware.
NOTE: Inference requests in each stream are parallel.
The optimal configuration is the one with the highest throughput value. Latency, or execution time of an inference, is critical for real-time services. The common technique for improving performance is batching. However, real-time applications often cannot take advantage of batching, because high batch size comes with the latency penalty. With the 2018 R5 release, OpenVINO™ introduced a throughput mode which allows the Inference Engine to efficiently run multiple inference requests simultaneously, greatly improving the throughput.
To find an optimal combination of inference requests and batches, follow the steps described in Run Range of Inferences.
The optimal combination is the highest point on the Inference Results graph. However, you might choose to limit latency values by specifying the Max. Latency value and select an optimal inference among the limited number of inferences:
To view information about latency, throughput, batch, and parallel requests of a specific job, hover your cursor over the corresponding point on the graph:
For details, read Integrate the Inference Engine New Request API with Your Application.