The purpose of this document is to give you performance-related insights to every step of the network deployment process.
Deep Learning Inference Engine is a part of Intel® Deep Learning Deployment Toolkit (Intel® DL Deployment Toolkit) and OpenVINO™ toolkit. Inference Engine facilitates deployment of deep learning solutions by delivering a unified, device-agnostic API.
Below, there are the three main steps of the deployment process:
Performance data comes in a variety of forms. For example, one of the the most common performance metrics is latency, which represents the time required to complete a unit of work (for instance, inference time for a single image). In the following sections, you will see important recommendations for measuring the performance.
When evaluating performance of your model with the Inference Engine, you must measure the proper set of operations. To do so, consider the following tips:
NOTE: Some image pre-processing can be baked into the IR and accelerated. For more information, refer to Model Optimizer Knobs Related to Performance.
In the asynchronous case (see Request-Based API and “GetBlob” Idiom), the performance of an individual infer request is usually of less concern. Instead, you typically execute multiple requests asynchronously and measure the throughput in images per second by dividing the number of images that were processed by the processing time. In contrast, for the latency-oriented tasks, the time to a single frame is more important.
Refer to the Benchmark App sample, which allows latency vs. throughput measuring.
NOTE: Most samples also support batching (automatically packing multiple input images into a single request). However, high batch size results in a latency penalty. So for more real-time oriented usages, lower batch sizes (as low as a single input) are usually used. However, devices like CPU, Intel® Movidius™ Myriad™ 2 VPU, Intel® Movidius™ Myriad™ X VPU, or Intel® Vision Accelerator Design with Intel® Movidius™ VPU require a number of parallel requests instead of batching to leverage the performance.
When comparing the Inference Engine performance with the framework or another reference code, make sure that both versions are as similar as possible:
FP16support, so when comparing to that, make sure to test the Inference Engine with the
You need to build your performance conclusions on reproducible data. Do the performance measurements with a large number of invocations of the same routine. Since the first iteration is almost always significantly slower than the subsequent ones, you can use an aggregated value for the execution time for final projections:
Refer to the Inference Engine Samples for code examples for the performance measurements. Almost every sample, except interactive demos, has a
-ni option to specify the number of iterations.
Networks training is typically done on high-end data centers, using popular training frameworks like Caffe*, TensorFlow*, and MXNet*. Model Optimizer converts the trained model in original proprietary formats to IR that describes the topology. IR is accompanied by a binary file with weights. These files in turn are consumed by the Inference Engine and used for scoring.
As described in the Model Optimizer Guide, there are a number of device-agnostic optimizations the tool performs. For example, certain primitives like linear operations (BatchNorm and ScaleShift), are automatically fused into convolutions. Generally, these layers should not be manifested in the resulting IR:
The picture above shows Caffe* Resnet269* topology. The left model is the original model, and the one on the right (after conversion) is the resulting model that the Model Optimizer produces, with BatchNorm and ScaleShift layers fused into the convolution weights rather than constituting separate layers.
If you still see these operations, inspect the Model Optimizer output carefully while searching for warnings, such as on the tool being unable to fuse. For example, non-linear operations (like activations) in between convolutions and linear operations might prevent the fusing. If performance is of concern, try to change (and potentially re-train) the topology. Refer to the Model Optimizer Guide for more optimizations.
Notice that the activation (
_relu) is not touched by the Model Optimizer, and while it can be merged into convolution as well, this is rather a device-specific optimization, covered by Inference Engine during the model loading time. You are encouraged to inspect performance counters from plugins that should indicate that these particular layers are not executed (“Optimized out”). For more information, refer to Internal Inference Performance Counters.
–mean_values) with the Model Optimizer when you need pre-processing. It allows the tool to bake the pre-processing into the IR to get accelerated by the Inference Engine.
--reverse_input_channelscommand line option, so you do not need to convert your inputs to RGB every time you get the BGR image, for example, from OpenCV*.
FP32, directly affects performance. As CPU now supports
FP16(while internally upscaling to
FP32anyway) and because this is the best precision for a GPU target, you may want to always convert models to
FP16. Notice that this is the only precision that Intel® Movidius™ Myriad™ 2 and Intel® Myriad™ X VPUs support.
The Inference Engine supports several target devices (CPU, GPU, Intel® Movidius™ Myriad™ 2 VPU, Intel® Movidius™ Myriad™ X VPU, Intel® Vision Accelerator Design with Intel® Movidius™ Vision Processing Units (VPU) and FPGA), and each of them has a corresponding plugin. If you want to optimize a specific device, you must keep in mind the following tips to increase the performance.
CPU plugin completely relies on the Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) for major primitives acceleration, for example, Convolutions or FullyConnected.
The only hint you can get from that is how the major primitives are accelerated (and you cannot change this). For example, on the Core machines, you should see variations of the
jit_avx2 when inspecting the internal inference performance counters (and additional '_int8' postfix for int8 inference). If you are an advanced user, you can further trace the CPU execution with (see Intel® VTune™).
Internally, the Inference Engine has a threading abstraction level, which allows for compiling the open source version with either Intel® Threading Building Blocks (Intel® TBB) which is now default, or OpenMP* as an alternative parallelism solution. When using inference on the CPU, this is particularly important to align threading model with the rest of your application (and any third-party libraries that you use) to avoid oversubscription. For more information, see Note on the App-Level Threading section.
Since R1 2019, the OpenVINO™ toolkit comes pre-compiled with Intel TBB, so any OpenMP* API or environment settings (like
OMP_NUM_THREADS) has no effect anymore. Certain tweaks (like number of threads used for inference on the CPU) are still possible via CPU configuration options. Finally, the OpenVINO CPU inference is NUMA-aware, please refer to the Tips for inference on NUMA systems section.
Other general recommendations:
Unlike most accelerators, CPU is perceived as an inherently latency-oriented device. In fact, the OpenVINO does support the "throughput" mode for the CPU, which allows the Inference Engine to efficiently run multiple inference requests on the CPU simultaneously, greatly improving the overall throughput.
Internally, the execution resources are split/pinned into execution "streams". This feature usually provides much better performance for the networks than batching. This is especially true for the many-core server machines:
Try the Benchmark App sample and play with number of streams running in parallel. The rule of thumb is tying up to a number of CPU cores on your machine. For example, on an 8-core CPU, compare the
-nstreams 1 (which is a legacy, latency-oriented scenario) to the 2, 4, and 8 streams.
In addition, you can play with the batch size to find the throughput sweet spot.
If your application is hard or impossible to change in accordance with the multiple-requests logic, consider the "multiple-instance" trick to improve the throughput:
#physcores and further, while trying to saturate the machine with running multiple instances of the application.
Inference Engine relies on the Compute Library for Deep Neural Networks (clDNN) for Convolutional Neural Networks acceleration on Intel® GPUs. Internally, clDNN uses OpenCL™ to implement the kernels. Thus, many general tips apply:
FP32, as the Model Optimizer can generate both variants and the
NOTE: See the Benchmark App Sample code for a usage example.
Notice that while disabling the polling, this option might reduce the GPU performance, so usually this option is used with multiple GPU streams.
Since Intel® Movidius™ Myriad™ X Visual Processing Unit (Intel® Movidius™ Myriad™ 2 VPU) communicates with the host over USB, minimum four infer requests in flight are recommended to hide the data transfer costs. See Request-Based API and “GetBlob” Idiom and Benchmark App Sample for more information.
Intel® Vision Accelerator Design with Intel® Movidius™ VPUs requires to keep at least 32 inference requests in flight to fully saturate the device.
Below are listed the most important tips for the efficient usage of the FPGA:
-nior 'niter' option to do that).
Heterogeneous execution (constituted by the dedicated Inference Engine “Hetero” plugin) enables to schedule a network inference to the multiple devices.
The primary points for executing a network in heterogeneous mode are as follows:
The execution through heterogeneous plugin has three distinct steps:
Performance benefits of the heterogeneous execution depend heavily on the communications granularity between devices. If transmitting/converting data from one part device to another takes more time than the execution, the heterogeneous approach makes little or no sense. Using Intel® VTune™ helps to visualize the execution flow on a timeline (see Intel® VTune™ Examples).
Similarly, if there are too much subgraphs, the synchronization and data transfers might eat the entire performance. In some cases, you can define the (coarser) affinity manually to avoid sending data back and forth many times during one inference.
The general affinity “rule of thumb” is to keep computationally-intensive kernels on the accelerator, and "glue" or helper kernels on the CPU. Notice that this includes the granularity considerations. For example, running some custom activation (that comes after every accelerator-equipped convolution) on the CPU might result in performance degradation due to too much data type and/or layout conversions, even though the activation itself can be extremely fast. In this case, it might make sense to consider implementing the kernel for the accelerator (see Optimizing Custom Kernels). The conversions typically manifest themselves as outstanding (comparing to CPU-only execution) 'Reorder' entries (see Internal Inference Performance Counters).
For general details on the heterogeneous plugin, refer to the corresponding section in the Inference Engine Developer Guide.
Every Inference Engine sample supports the
-d (device) option.
For example, here is a command to run an Object Detection Sample SSD Sample:
HETEROstands for Heterogeneous plugin.
FPGA,CPUpoints to fallback policy with first priority on FPGA and further fallback to CPU.
You can point more than two devices:
As FPGA is considered as an inference accelerator, most performance issues are related to the fact that due to the fallback, the CPU can be still used quite heavily.
SoftMaxin most classification models or
DetectionOutputin the SSD*-based topologies). In that case, limiting the number of CPU threads with `KEY_CPU_THREADS_NUM` config would further reduce the CPU utilization without significantly degrading the overall performance.
KMP_BLOCKTIMEenvironment variable to something less than default 200ms (we suggest 1ms) is particularly helpful. Use
KMP_BLOCKTIME=0if the CPU subgraph is small.
NOTE: General threading tips (see Note on the App-Level Threading) apply well, even when the entire topology fits the FPGA, because there is still a host-side code for data pre- and post-processing.
The following tips are provided to give general guidance on optimizing execution on GPU/CPU devices.
-d GPU). If there are specific kernels that are not supported by the GPU, the best option to try is the
HETERO:GPU,CPUthat automatically applies default splitting (based on the plugins layers support). Then, you can play with the manual affinity settings (for example, to further minimize the number of subgraphs).
FP32(CPU) execution results in conversions and, thus, performance issues. If you are seeing a lot of heavy outstanding (compared to the CPU-only execution) Reorders, consider implementing actual GPU kernels. Refer to Internal Inference Performance Counters for more information.
There is a dedicated configuration option that enables dumping the visualization of the subgraphs created by the heterogeneous plugin, please see code example in the HETERO plugin documentation
After enabling the configuration key, the heterogeneous plugin generates two files:
hetero_affinity.dot- per-layer affinities. This file is generated only if default fallback policy was executed (as otherwise you have set the affinities by yourself, so you know them).
hetero_subgraphs.dot- affinities per sub-graph. This file is written to the disk during execution of
Core::LoadNetworkfor the heterogeneous flow.
You can use GraphViz* utility or
.dot converters (for example, to
sudo apt-get install xdot. Below is an example of the output trimmed to the two last layers (one executed on the FPGA and another on the CPU):
You can also use performance data (in the Benchmark App, it is an option
-pc) to get performance data on each subgraph. Again, refer to the HETERO plugin documentation and to Internal Inference Performance Counters for a general counters information.
The Inference Engine supports CPU, GPU and VPU custom kernels. Typically, custom kernels are used to quickly implement missing layers for new topologies. You should not override standard layers implementation, especially on the critical path, for example, Convolutions. Also, overriding existing layers can disable some existing performance optimizations, such as fusing.
It is usually easier to start with the CPU extension and switch to the GPU after debugging with the CPU path. Sometimes, when the custom layers are at the very end of your pipeline, it is easier to implement them as regular post-processing in your application without wrapping them as kernels. This is particularly true for the kernels that do not fit the GPU well, for example, output bounding boxes sorting. In many cases, you can do such post-processing on the CPU.
There are many cases when sequence of the custom kernels can be implemented as a "super" kernel allowing to save on data accesses.
Finally, with the heterogeneous execution, it is possible to execute the vast majority of intensive computations with the accelerator and keep the custom pieces on the CPU. The tradeoff is granularity/costs of communication between different devices.
For more details on custom layers in Inference Engine, see Inference Engine Extensibility Mechanism
In most cases, before actually implementing a full-blown code for the kernel, you can estimate the final performance by doing a simple stub kernel that does nothing (and thus is "infinitely" fast) just to let the topology execute end-to-end. Of course, the estimation is valid only if the kernel output does not affect the performance, for instance, if its output is not driving any branches or loops.
Other than that, when implementing the kernels, you can try the methods from the previous chapter to understand actual contribution and, if any custom kernel is in the hotspots, optimize that.
For inference on the CPU there are multiple threads binding options, see CPU configuration options.
If you are building an app-level pipeline with third-party components like GStreamer*, the general guidance for NUMA machines is as follows:
numactlcommand with proper settings before actual GStreamer commands).
LD_PRELOADon Linux* OS.
In many cases, a network expects a pre-processed image, so make sure you do not perform unnecessary steps in your code:
FP32on your side, as this is something that plugins can accelerate. Use the
InferenceEngine::Precision::U8as your input format:
Note that in many cases, you can directly share the (input) data with the Inference Engine.
The general approach for sharing data between Inference Engine and media/graphics APIs like Intel® Media Server Studio (Intel® MSS) is based on sharing the system memory. That is, in your code, you should map or copy the data from the API to the CPU address space first.
For Intel MSS, it is recommended to perform a viable pre-processing, for example, crop/resize, and then convert to RGB again with the Video Processing Procedures (VPP). Then lock the result and create an Inference Engine blob on top of that. The resulting pointer can be used for the
InferenceEngine::NHWC layout is not supported natively by most InferenceEngine plugins so internal conversion might happen.
Alternatively, you can use RGBP (planar RGB) output from Intel MSS. This allows to wrap the (locked) result as regular NCHW which is generally friendly for most plugins (unlike NHWC). Then you can use it with
SetBlob just like in previous example:
The only downside of this approach is that VPP conversion to RGBP is not hardware accelerated (and performed on the GPU EUs). Also, it is available only on LInux.
Unlike APIs that use dedicated address space and/or special data layouts (for instance, compressed OpenGL* textures), regular OpenCV data objects like
cv::Mat reside in the conventional system memory. That is, the memory can be actually shared with the Inference Engine and only data ownership to be transferred.
Again, if the OpenCV and Inference Engine layouts match, the data can be wrapped as Inference Engine (input/output) blob. Notice that by default, Inference Engine accepts the planar and not interleaved inputs in NCHW, so the NHWC (which is exactly the interleaved layout) should be specified explicitly:
InferenceEngine::NHWC layout is not supported natively by most InferenceEngine plugins so internal conversion might happen.
Notice that original
cv::Mat/blobs cannot be used simultaneously by the application and the Inference Engine. Alternatively, the data that the pointer references to can be copied to unlock the original data and return ownership to the original API.
Infer Request based API offers two types of request: Sync and Async. The Sync is considered below. The Async splits (synchronous)
Wait (see Inference Engine Async API).
More importantly, an infer request encapsulates the reference to the “executable” network and actual inputs/outputs. Now, when you load the network to the plugin, you get a reference to the executable network (you may consider that as a queue). Actual infer requests are created by the executable network:
GetBlob is a recommend way to communicate with the network, as it internally allocates the data with right padding/alignment for the device. For example, the GPU inputs/outputs blobs are mapped to the host (which is fast) if the
GetBlob is used. But if you called the
SetBlob, the copy (from/to the blob you have set) into the internal GPU plugin structures will happen.
If your application simultaneously executes multiple infer requests:
EXCLUSIVE_ASYNC_REQUESTSconfiguration option that limits the number of the simultaneously executed requests for all (executable) networks that share the specific device to just one:
<br>For more information on the executable networks notation, see <a href="#new-request-based-api">Request-Based API and “GetBlob” Idiom</a>. - The heterogeneous device uses the `EXCLUSIVE_ASYNC_REQUESTS` by default. - `KEY_EXCLUSIVE_ASYNC_REQUESTS` option affects only device queues of the individual application.
In the Inference Engine, there is no notion of requests priorities. It is left to the user side (for example, not queuing the low priority infer request, until another higher priority is waiting). Notice that it would require additional logic to synchronize between executable networks (queues) in your application code.
Inference Engine Async API can improve overall frame rate of the application. While accelerator is busy with the inference, the application can continue doing things on the host rather than wait for the inference to complete.
In the example below, inference is applied to the results of the video decoding. So it is possible to keep two parallel infer requests, and while the current is processed, the input frame for the next is being captured. This essentially hides the latency of capturing, so that the overall frame rate is rather determined only by the slowest part of the pipeline (decoding IR inference) and not by the sum of the stages.
You can compare the pseudo-codes for the regular and async-based approaches:
NEXTrequest is populated in the main (application) thread, while the
CURRENTrequest is processed:
The technique can be generalized to any available parallel slack. For example, you can do inference and simultaneously encode the resulting or previous frames or run further inference, like emotion detection on top of the face detection results.
There are important performance caveats though: for example, the tasks that run in parallel should try to avoid oversubscribing the shared compute resources. If the inference is performed on the FPGA and the CPU is essentially idle, it makes sense to do things on the CPU in parallel. However, multiple infer requests can oversubscribe that. Notice that heterogeneous execution can implicitly use the CPU, refer to Heterogeneity.
Also, if the inference is performed on the graphics processing unit (GPU), it can take little gain to do the encoding, for instance, of the resulting video, on the same GPU in parallel, because the device is already busy.
Refer to the Object Detection SSD Demo (latency-oriented Async API showcase) and Benchmark App Sample (which has both latency and throughput-oriented modes) for complete examples of the Async API in action.
Whether you are tuning for the first time or doing advanced performance optimization, you need a a tool that provides accurate insights. Intel® VTune™ Amplifier gives you the tool to mine it and interpret the profiling data.
Alternatively, you can gather the raw profiling data that samples report, the second chapter provides example of how to interpret these.
All major performance calls of the Inference Engine are instrumented with Instrumentation and Tracing Technology APIs. This allows viewing the Inference Engine calls on the Intel® VTune™ timelines and aggregations plus correlating them to the underlying APIs, like OpenCL. In turn, this enables careful per-layer execution breakdown.
When choosing the Analysis type in Intel® VTune™ Amplifier, make sure to select the Analyze user tasks, events, and counters option:
See the corresponding section in the Intel® VTune™ Amplifier User's Guide for details.
Example of Inference Engine calls:
On the Intel VTune Amplifier timeline. Notice that
Task_runNOThrow is an Async API wrapper and it is executed in a different thread and triggers the Intel MKL-DNN execution:
In the Intel VTune Amplifier Top-down view, grouped by the Task Domain. Notice the
MKLDNN _INFER that are bracketing the actual Intel MKL-DNN kernels execution:
Similarly, you can use any GPU analysis in the Intel VTune Amplifier and get general correlation with Inference Engine API as well as the execution breakdown for OpenCL kernels.
Just like with regular native application, further drill down in the counters is possible, however, this is mostly useful for optimizing custom kernels. Finally, with the Intel VTune Amplifier, the profiling is not limited to your user-level code (see the corresponding section in the Intel® VTune™ Amplifier User's Guide).
Almost every sample (inspect command-line options for a specific sample with
-h) supports a
-pc command that outputs internal execution breakdown. Refer to the samples code for the actual Inference Engine API behind that.
Below is example of CPU plugin output for a network (since the device is CPU, the layers wall clock
realTime and the
cpu time are the same):
This contains layers name (as seen in IR), layers type and execution statistics. Notice the
OPTIMIZED_OUT, which indicates that the particular activation was fused into adjacent convolution. Also, the
unknown stays for the Inference Engine specific CPU (helper) primitives that are not part of the Intel MKL-DNN.
Notice that there are some helper layers in the CPU execution breakdown, which were not presented in the original topology. These are automatically added by the plugin. For example, the
Reorder re-packs the Intel MKL-DNN internal (blocked) layout to the regular plain NCHW (that the user expects as the output). As explained in the Few Device-Specific Tips, if your custom kernels introduces a lot of outstanding/expensive Reorders, consider blocked implementation for the kernels.
Notice that in the heterogeneous cases, there will be additional information on which subgraph the statistics is about (the first subgraph is GPU, so its
cpu/host time is really small compared to the actual
As mentioned earlier,
unknown here means CPU kernel with unknown (for example, not AVX2 or AVX512) acceleration path. Since FPGA execution does not separate individual kernels, only bulk execution/data transfer statistics is available:
softmax/copy is a glue layer that connects the FPGA subgraph to the CPU subgraph (and copies the data).