The purpose of this document is to give you performance-related insights to every step of the network deployment process.
For information on the general workflow, refer to the documentation in See Also. For an example Inference Engine API snippet, see Request-Based API and “GetBlob” Idiom.
Deep Learning Inference Engine is a part of Intel® Deep Learning Deployment Toolkit (Intel® DL Deployment Toolkit) and OpenVINO™ toolkit. Inference Engine facilitates deployment of deep learning solutions by delivering a unified, device-agnostic API.
Below, there are the three main steps of the deployment process:
Performance data comes in a variety of forms. For example, one of the the most common performance metrics is latency, which represents the time required to complete a unit of work (for instance, inference time for a single image). In the following sections, you will see important recommendations for measuring the performance.
When evaluating performance of your model with the Inference Engine, you must measure the proper set of operations. To do so, consider the following tips:
NOTE: Some image pre-processing can be baked into the IR and accelerated. For more information, refer to Model Optimizer Knobs Related to Performance.
In the asynchronous case (see Request-Based API and “GetBlob” Idiom), no direct way exists to measure the performance of an individual infer request. Instead, you typically execute multiple requests asynchronously and measure the throughput in images per second by dividing the number of images that were processed by the processing time. For example, see Image Classification Sample Async or other Async Inference Engine demos.
In contrast, for the latency-oriented tasks, the time to a single frame is more important. This is how the Image Classification Sample and other getting started examples measure the performance.
NOTE: Most samples also support batching (automatically packing multiple input images into a single request). However, high batch size results in a latency penalty. So for more real-time oriented usages, lower batch sizes (as low as a single input) are usually used. In contrast, batch with many (potentially tens of) input images might be required to achieve optimal throughput.
Refer to the Benchmark App sample, which allows latency vs. throughput measuring.
When comparing the Inference Engine performance with the framework or another reference code, make sure that both versions are as similar as possible:
FP16
support, so when comparing to that, make sure to test the Inference Engine with the FP16
as well.You need to build your performance conclusions on reproducible data. Do the performance measurements with a large number of invocations of the same routine. Since the first iteration is almost always significantly slower than the subsequent ones, you can use an aggregated value for the execution time for final projections:
Refer to the Inference Engine Samples for code examples for the performance measurements. Almost every sample, except interactive demos, has a -ni
option to specify the number of iterations.
Networks training is typically done on high-end data centers, using popular training frameworks like Caffe*, TensorFlow*, and MXNet*. Model Optimizer converts the trained model in original proprietary formats to IR that describes the topology. IR is accompanied by a binary file with weights. These files in turn are consumed by the Inference Engine and used for scoring.
As described in the Model Optimizer Guide, there are a number of device-agnostic optimizations the tool performs. For example, certain primitives like linear operations (BatchNorm and ScaleShift), are automatically fused into convolutions. Generally, these layers should not be manifested in the resulting IR:
The picture above shows Caffe* Resnet269* topology. The left model is the original model, and the one on the right (after conversion) is the resulting model that the Model Optimizer produces, with BatchNorm and ScaleShift layers fused into the convolution weights rather than constituting separate layers.
If you still see these operations, inspect the Model Optimizer output carefully while searching for warnings, such as on the tool being unable to fuse. For example, non-linear operations (like activations) in between convolutions and linear operations might prevent the fusing. If performance is of concern, try to change (and potentially re-train) the topology. Refer to the Model Optimizer Guide for more optimizations.
Notice that the activation (_relu
) is not touched by the Model Optimizer, and while it can be merged into convolution as well, this is rather a device-specific optimization, covered by Inference Engine during the model loading time. You are encouraged to inspect performance counters from plugins that should indicate that these particular layers are not executed (“Optimized out”). For more information, refer to Internal Inference Performance Counters.
Also:
--scale
and –mean_values
) with the Model Optimizer when you need pre-processing. It allows the tool to bake the pre-processing into the IR to get accelerated by the Inference Engine.--reverse_input_channels
command line option, so you do not need to convert your inputs to RGB every time you get the BGR image, for example, from OpenCV*.FP16
or float32
, is directly affecting the performance. Try the FP16
precision for a GPU target devide. Notice that this is the only precision that the Intel® Movidius™ Myriad™ 2 and Intel® Myriad™ X Visual Processing Units support.The Inference Engine supports several target devices (CPU, GPU, Intel® Movidius™ Myriad™ 2 VPU, Intel® Movidius™ Myriad™ X VPU, and FPGA), and each of them has a corresponding plugin. If you want to optimize a specific device, you must keep in mind the following tips to increase the performance.
CPU plugin completely relies on the Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) for major primitives acceleration, for example, Convolutions or FullyConnected.
The only hint from the Inference Engine execution is how the major primitives are accelerated (and you cannot change this). For example, on the Core machines, you should see variations of the jit_avx2
when inspecting the internal inference performance counters. If you are an advanced user, you can further trace the CPU execution..
Internally, the Inference Engine CPU plugin has a threading abstraction level, which allows for compiling the open source version with either Intel® Threading Building Blocks (Intel® TBB) or OpenMP* as a parallelism solution. This is particularly important for aligning threading model with the rest of your application (and any third-party libraries that you use) to avoid oversubscription. For more information, see Note on the App-Level Threading section.
The OpenVINO™ toolkit currently comes pre-compiled with OpenMP*, so all general and Intel-specific OpenMP* environment settings like OMP_NUM_THREADS
apply. To align configuration capabilities of the Inference Engine compiled with TBB, use general CPU configuration options rather than OpenMP-specific settings. Note that it also affects the rest of the program beyond Inference Engine itself.
Other general recommendations:
Unlike most accelerators, CPU is perceived as an inherently latency-oriented device. With the 2018 R5 release, the OpenVINO introduced the "throughput" mode for the CPU, which allows the Inference Engine to efficiently run multiple inference requests on the CPU simultaneously, greatly improving the throughput.
Internally, the execution resources are split/pinned into execution "streams". This feature provides much better performance for the networks that are not originally scaled well with a number of threads (for example, lightweight topologies). This is especially pronounced for the many-core server machines.
Try the Benchmark App sample and play with number of infer requests running in parallel. The rule of thumb is tying up to a number of CPU cores on your machine. For example, on an 8-core CPU, compare the -nireq 1
(which is a legacy scenario) to the 2, 4, and 8 requests.
In addition to the number of the inference requests in flight (and hence, the CPU execution streams), you can also play with the batch size to find the best throughput.
If your application is hard or impossible to change in accordance with the multiple-requests logic, consider the "multiple-instance" trick to improve the throughput:
OMP_NUM_THREADS
to the number of cores per socket, pre-pin the threads using KMP_AFFINITY=granularity=fine, compact,1,0
, and run as many instances of the application as you have sockets.OMP
threads to just #phys
cores and further, while trying to saturate the machine with running multiple instances of the application.Inference Engine relies on the Compute Library for Deep Neural Networks (clDNN) for Convolutional Neural Networks acceleration on Intel® GPUs. Internally, clDNN uses OpenCL™ to implement the kernels. Thus, many general tips apply:
FP16
over FP32
, as the Model Optimizer can generate both variants and the FP32
is default.Since Intel® Movidius™ Myriad™ X Visual Processing Unit (Intel® Movidius™ Myriad™ 2 VPU) communicates with the host over USB, minimum two infer requests in flight are recommended to hide the data transfer costs. See Request-Based API and “GetBlob” Idiom and Image Classification Sample Async for more information.
Below are listed the most important tips for the efficient usage of the FPGA:
-ni
option to do that).Heterogeneous execution (constituted by the dedicated Inference Engine “Hetero” plugin) enables to schedule a network inference to the multiple devices.
The primary points for executing a network in heterogeneous mode are as follows:
The execution through heterogeneous plugin has three distinct steps:
Performance benefits of the heterogeneous execution depend heavily on the communications granularity between devices. If transmitting/converting data from one part device to another takes more time than the execution, the heterogeneous approach makes little or no sense. Using Intel® VTune™ helps to visualize the execution flow on a timeline (see Intel® VTune™ Examples).
Similarly, if there are too much subgraphs, the synchronization and data transfers might eat the entire performance. In some cases, you can define the (coarser) affinity manually to avoid sending data back and forth many times during one inference.
The general affinity “rule of thumb” is to keep computationally-intensive kernels on the accelerator, and auxiliary, or secondary, kernels on the CPU. Notice that this includes the granularity considerations. For example, running some custom activation (that comes after every accelerator-equipped convolution) on the CPU might result in performance degradation due to too much data type and/or layout conversions, even though the activation itself can be extremely fast. In this case, it might make sense to consider implementing the kernel for the accelerator (see Optimizing Custom Kernels). The conversions typically manifest themselves as outstanding (comparing to CPU-only execution) Reorders (see Internal Inference Performance Counters).
For general details on the heterogeneous plugin, refer to the corresponding section in the Inference Engine Developer Guide.
Every Inference Engine sample that supports the -d
(device) option now accepts the new HETERO
option.
For example, here is a command to run an Object Detection Sample SSD Sample:
where:
HETERO
stands for Heterogeneous plugin.FPGA,CPU
points to fallback policy with first priority on FPGA and further fallback to CPU.You can point more than two devices: -d HETERO:FPGA,GPU,CPU
.
As FPGA is considered as an inference accelerator, most performance issues are related to the fact that due to the fallback, the CPU can be still used quite heavily. In this case, it minimizes the busy wait time when OpenMP threads loop in between parallel regions:
KMP_BLOCKTIME
environment variable to something less than default 200ms (we suggest 1ms) is particularly helpful. Use KMP_BLOCKTIME=0
if the CPU subgraph is small.SoftMax
in most classification models or PriorBoxDetectionOutput
in the SSD*-based topologies). In that case, limiting the number of CPU threads with OMP_NUM_THREADS
would further reduce the CPU utilization without significantly degrading the overall performance.NOTE: General threading tips (see Note on the App-Level Threading) apply well, even when the entire topology fits the FPGA, because there is still a host-side code for data pre- and post-processing.
The following tips are provided to give general guidance on optimizing execution on GPU/CPU devices.
-d CPU
or -d GPU
). If there are specific kernels that are not supported by the GPU, the best option to try is the HETERO:GPU,CPU
that automatically applies default splitting (based on the plugins layers support). Then, you can play with the manual affinity setting (for example, to further minimize the number of subgraphs).FP16
(GPU) and FP32
(CPU) execution results in conversions and, thus, performance issues. If you are seeing a lot of heavy outstanding (compared to the CPU-only execution) Reorders, consider implementing actual GPU kernels. Refer to Internal Inference Performance Counters for more information.There is a dedicated configuration option that enables dumping the visualization of the subgraphs created by the heterogeneous plugin:
After enabling the configuration key, the heterogeneous plugin generates two files:
hetero_affinity.dot
- per-layer affinities. This file is generated only if default fallback policy was executed (as otherwise you have set the affinities by yourself, so you know them).hetero_subgraphs.dot
- affinities per sub-graph. This file is written to the disk during execution of ICNNNetwork::LoadNetwork
for the heterogeneous plugin.You can use GraphViz* utility or *.dot
converters (for example, to .png
or .pdf
), like xdot*, available on Linux* OS with sudo apt-get install xdot
. Below is an example of the output trimmed to the two last layers (one executed on the FPGA and another on the CPU):
You can also use performance data (in samples, it is an option -pc
) to get performance data on each subgraph. Refer to Internal Inference Performance Counters for more information.
Today, the Inference Engine supports only CPU and GPU custom kernels. Typically, custom kernels are used to quickly implement missing layers for new topologies. You should not override standard layers implementation, especially on the critical path, for example, Convolutions. Also, overriding existing layers can disable some existing performance optimizations, such as fusing.
It is usually easier to start with the CPU extension and switch to the GPU after debugging with the CPU path. Sometimes, when the custom layers are at the very end of your pipeline, it is easier to implement them as regular post-processing in your application without wrapping them as kernels. This is particularly true for the kernels that do not fit the GPU well, for example, output bounding boxes sorting. In many cases, you can do such post-processing on the CPU.
There are many cases when sequence of the custom kernels can be implemented as a “super” kernel allowing to save on data accesses.
Finally, with the heterogeneous execution, it is possible to execute the vast majority of intensive computations with the accelerator and keep the custom pieces on the CPU. The tradeoff is granularity/costs of communication between different devices.
For more details on the API of the custom layers, see Custom Layers Support in Inference Engine
In most cases, before actually implementing a full-blown code for the kernel, you can estimate the final performance by doing a simple stub kernel that does nothing (and thus is "infinitely" fast) just to let the topology execute end-to-end. Of course, the estimation is valid only if the kernel output does not affect the performance, for instance, if its output is not driving any branches or loops.
Other than that, when implementing the kernels, you can try the methods from the previous chapter to understand actual contribution and, if any custom kernel is in the hotspots, optimize that.
<OPENVINO_INSTALL_DIR>/deployment_tools/samples/extension/
.InferenceEngine::ExecutableNetwork
, the CPU plugin creates as many threads as the number of cores on your machine. This can easily lead to oversubscription within the Inference Engine (see Performance Aspects of Running Multiple CPU Requests Simultaneously). There are also other threads in your application, so oversubscription is possible at the application level:LD_PRELOAD
on Linux* OS.In many cases, a network expects a pre-processed image, so make sure you do not perform unnecessary steps in your code:
FP32
on your side, as this is something that plugins can accelerate. Use the InferenceEngine::Precision::U8
as your input format:Notice that in many cases, you can directly share the (input) data with the Inference Engine.
The general approach for sharing data between Inference Engine and media/graphics APIs like Intel® Media Server Studio (Intel® MSS) is based on sharing the system memory. That is, in your code, you should map or copy the data from the API to the CPU address space first.
For Intel MSS, it is recommended to perform a viable pre-processing, for example, crop/resize, and then convert to RGB again with the Video Processing Procedures (VPP). Then lock the result and create an Inference Engine blob on top of that. The resulting pointer can be used for the SetBlob
:
WARNING: The InferenceEngine::NHWC
layout natively supported only by Intel Movidius Myriad™ 2 VPU device today, internal conversion might happen for the rest of plugins.
Alternatively, you can use RGBP (planar RGB) output from Intel MSS. This allows to wrap the (locked) result as regular NCHW which is generally friendly for most plugins (unlike NHWC). Then you can use it with SetBlob
just like in previous example:
The only downside of this approach is that VPP conversion to RGBP is not hardware accelerated (and performed on the GPU EUs).
Unlike APIs that use dedicated address space and/or special data layouts (for instance, compressed OpenGL* textures), regular OpenCV data objects like cv::Mat
reside in the conventional system memory. That is, the memory can be actually shared with the Inference Engine and only data ownership to be transferred.
Again, if the OpenCV and Inference Engine layouts match, the data can be wrapped as Inference Engine (input/output) blob. Notice that by default, Inference Engine accepts the planar and not interleaved inputs in NCHW, so the NHWC (which is exactly the interleaved layout) should be specified explicitly:
WARNING: The InferenceEngine::NHWC
layout natively supported only by Intel Movidius Myriad™ 2 VPU device today, internal conversion might happen for the rest of plugins.
Notice that original cv::Mat
/blobs cannot be used simultaneously by the application and the Inference Engine. Alternatively, the data that the pointer references to can be copied to unlock the original data and return ownership to the original API.
Infer Request based API offers two types of request: Sync and Async. The Sync is considered below. The Async splits (synchronous) Infer
into StartAsync
and Wait
(see Inference Engine Async API).
More importantly, an infer request encapsulates the reference to the “executable” network and actual inputs/outputs. Now, when you load the network to the plugin, you get a reference to the executable network (you may consider that as a queue). Actual infer requests are created by the executable network:
GetBlob
is a recommend way to communicate with the network, as it internally allocates the data with right padding/alignment for the device. For example, the GPU inputs/outputs blobs are mapped to the host (which is fast) if the GetBlob
is used. But if you called the SetBlob
, the copy (from/to the blob you have set) into the internal GPU plugin structures will happen.
By default, the Inference Engine is compiled with the OpenMP as a parallelism solution. Default OpenMP behavior is that every new thread that calls the CPU plugin implicitly spawns a full set of OpenMP threads (as this new thread becomes an OpenMP master for them). This happens for every InferenceEngine::ExecutableNetwork
, as every executable networkgets its own dedicated thread automatically.
If your application simultaneously executes infer requests from the multiple threads (or networks) on the CPU, make sure you do not oversubscribe the machine:
#cores/2
. Similarly, if you run four requests in parallel (for instance, four independent video streams), try KEY_CPU_THREADS_NUM
with #cores/4
. The downside is that this is inherently static approach.EXCLUSIVE_ASYNC_REQUESTS
configuration option that limits the number of the simultaneously executed requests for all (executable) networks that share the specific device to just one:EXCLUSIVE_ASYNC_REQUESTS
by default.KEY_EXCLUSIVE_ASYNC_REQUESTS
option affects only device queues of the individual application.Inference Engine Async API can improve overall frame rate of the application. While accelerator is busy with the inference, the application can continue doing things on the host rather than wait for the inference to complete.
In the example below, inference is applied to the results of the video decoding. So it is possible to keep two parallel infer requests, and while the current is processed, the input frame for the next is being captured. This essentially hides the latency of capturing, so that the overall frame rate is rather determined only by the slowest part of the pipeline (decoding IR inference) and not by the sum of the stages.
You can compare the pseudo-codes for the regular and async-based approaches:
NEXT
request is populated in the main (application) thread, while the CURRENT
request is processed:There are important performance caveats though: for example, the tasks that run in parallel should try to avoid oversubscribing the shared compute resources. If the inference is performed on the FPGA and the CPU is essentially idle, it makes sense to do things on the CPU in parallel. However, multiple infer requests can oversubscribe that. Notice that heterogeneous execution can implicitly use the CPU, refer to Heterogeniouty .
Also, if the inference is performed on the graphics processing unit (GPU), it can take little gain to do the encoding, for instance, of the resulting video, on the same GPU in parallel, because the device is already busy.
Refer to the Object Detection SSD Demo (latency-oriented Async API showcase) and Image Classification Sample Async (throughput-oriented) for complete examples of the Async API in action.
Whether you are tuning for the first time or doing advanced performance optimization, you need a a tool that provides accurate insights. Intel® VTune™ Amplifier gives you the tool to mine it and interpret the profiling data.
Alternatively, you can gather the raw profiling data that samples report, the second chapter provides example of how to interpret these.
All major performance calls of the Inference Engine are instrumented with Instrumentation and Tracing Technology APIs. This allows viewing the Inference Engine calls on the Intel® VTune™ timelines and aggregations plus correlating them to the underlying APIs, like OpenCL. In turn, this enables careful per-layer execution breakdown.
When choosing the Analysis type in Intel® VTune™ Amplifier, make sure to select the Analyze user tasks, events, and counters option:
See the corresponding section in the Intel® VTune™ Amplifier 2018 User's Guide for details.
Example of Inference Engine calls:
On the Intel VTune Amplifier timeline. Notice that Task_runNOThrow
is an Async API wrapper and it is executed in a different thread and triggers the Intel MKL-DNN execution:
In the Intel VTune Amplifier Top-down view, grouped by the Task Domain. Notice the Task_runNoThrow
and MKLDNN _INFER
that are bracketing the actual Intel MKL-DNN kernels execution:
Similarly, you can use any GPU analysis in the Intel VTune Amplifier and get general correlation with Inference Engine API as well as the execution breakdown for OpenCL kernels.
Just like with regular native application, further drill down in the counters is possible, however, this is mostly useful for optimizing custom kernels. Finally, with the Intel VTune Amplifier, the profiling is not limited to your user-level code (see the corresponding section in the Intel® VTune™ Amplifier 2018 User's Guide).
Almost every sample (inspect command-line options for a specific sample with -h
) supports a -pc
command that outputs internal execution breakdown. Refer to the samples code for the actual Inference Engine API behind that.
Below is example of CPU plugin output for a network (since the device is CPU, the layers wall clock realTime
and the cpu
time are the same):
This contains layers name (as seen in IR), layers type and execution statistics. Notice the OPTIMIZED_OUT
, which indicates that the particular activation was fused into adjacent convolution. Also, the unknown
stays for the Inference Engine specific CPU (helper) primitives that are not part of the Intel MKL-DNN.
Notice that there are some helper layers in the CPU execution breakdown, which were not presented in the original topology. These are automatically added by the plugin. For example, the Reorder
re-packs the Intel MKL-DNN internal (blocked) layout to the regular plain NCHW (that the user expects as the output). As explained in the Few Device-Specific Tips, if your custom kernels introduces a lot of outstanding/expensive Reorders, consider blocked implementation for the kernels.
Notice that in the heterogeneous cases, there will be additional information on which subgraph the statistics is about (the first subgraph is GPU, so its cpu
/host time is really small compared to the actual realTime
):
As mentioned earlier, unknown
here means CPU kernel with unknown (for example, not AVX2 or AVX512) acceleration path. Since FPGA execution does not separate individual kernels, only bulk execution/data transfer statistics is available:
The softmax/copy
is a glue layer that connects the FPGA subgraph to the CPU subgraph (and copies the data).