Heterogeneous Plugin

Introducing Heterogeneous Plugin

The heterogeneous plugin enables computing for inference on one network on several devices. Purposes to execute networks in heterogeneous mode

The execution through heterogeneous plugin can be divided to two independent steps:

These steps are decoupled. The setting of affinity can be done automatically using fallback policy or in manual mode.

The fallback automatic policy means greedy behavior and assigns all layers which can be executed on certain device on that device follow priorities.

Some of the topologies are not friendly to heterogeneous execution on some devices or cannot be executed in such mode at all. Example of such networks might be networks having activation layers which are not supported on primary device. If transmitting of data from one part of network to another part in heterogeneous mode takes relatively much time, then it is not much sense to execute them in heterogeneous mode on these devices. In this case you can define heaviest part manually and set affinity thus way to avoid sending of data back and forth many times during one inference.

Annotation of Layers per Device and Default Fallback Policy

Default fallback policy decides which layer goes to which device automatically according to the support in dedicated plugins (FPGA,GPU,CPU,VPU).

Another way to annotate a network is setting affinity manually using CNNLayer::affinity field. This field accepts string values of devices like "CPU" or "FPGA".

The fallback policy does not work if even one layer has initialized affinity. The sequence should be calling of automating affinity settings and then fix manually.

// This example demonstrate how to do default affinity initialization and then
// correct affinity manually for some layers
InferenceEngine::PluginDispatcher dispatcher({ FLAGS_pp, archPath , "" });
enginePtr = dispatcher.getPluginByDevice("HETERO:FPGA,CPU");
HeteroPluginPtr hetero(enginePtr);
hetero->SetAffinity(network, { }, &resp);
network.getLayerByName("qqq")->affinity = "CPU";
InferencePlugin plugin(enginePtr);
auto executable_network = plugin.LoadNetwork(network, {});

If you rely on default affinity distribution, you can avoid calling IHeteroInferencePlugin::SetAffinity and just call ICNNNetwork::LoadNetwork instead:

InferenceEngine::PluginDispatcher dispatcher({ FLAGS_pp, archPath , "" });
enginePtr = dispatcher.getPluginByDevice("HETERO:FPGA,CPU");
InferencePlugin plugin(enginePtr);
CNNNetReader reader;
auto executable_network = plugin.LoadNetwork(network, {});

Details of Splitting Network and Execution

During loading of the network to heterogeneous plugin, network is divided to separate parts and loaded to dedicated plugins. Intermediate blobs between these sub graphs are allocated automatically in the most efficient way.

Execution Precision

Precision for inference in heterogeneous plugin is defined by


Samples can be used with the following command:

./object_detection_sample_ssd -m <path_to_model>/ModelSSD.xml -i <path_to_pictures>/picture.jpg -d HETERO:FPGA,CPU


You can point more than two devices: -d HETERO:FPGA,GPU,CPU

Analyzing Heterogeneous Execution

After enabling of KEY_HETERO_DUMP_GRAPH_DOT config key, you can dump GraphViz* .dot files with annotations of devices per layer.

Heterogeneous plugin can generate two files:

enginePtr = dispatcher.getPluginByDevice("HETERO:FPGA,CPU");
InferencePlugin plugin(enginePtr);
plugin.SetConfig({ {KEY_HETERO_DUMP_GRAPH_DOT, YES} });

You can use GraphViz* utility or converters to .png formats. On Ubuntu* operating system, you can use the following utilities:

Besides generation of .dot files, you can use error listening mechanism:

class FPGA_ErrorListener : public InferenceEngine::IErrorListener
virtual void onError(const char *msg) noexcept override {
std::cout << msg;
FPGA_ErrorListener err_listener;

If during network loading some layers are decided to be executed on a fallback plugin, the following message is printed:

Layer (Name: detection_out, Type: DetectionOutput) is not supported:
custom or unknown.
Has (3) sets of inputs, must be 1, or 2.
Input dimensions (2) should be 4.

You can use performance data (in samples, it is an option -pc) to get performance data on each subgraph.

Here is an example of the output: for Googlenet v1 running on FPGA with fallback to CPU:

subgraph1: 1. input preprocessing (mean data/FPGA):EXECUTED layerType: realTime: 129 cpu: 129 execType:
subgraph1: 2. input transfer to DDR:EXECUTED layerType: realTime: 201 cpu: 0 execType:
subgraph1: 3. FPGA execute time:EXECUTED layerType: realTime: 3808 cpu: 0 execType:
subgraph1: 4. output transfer from DDR:EXECUTED layerType: realTime: 55 cpu: 0 execType:
subgraph1: 5. FPGA output postprocessing:EXECUTED layerType: realTime: 7 cpu: 7 execType:
subgraph1: 6. copy to IE blob:EXECUTED layerType: realTime: 2 cpu: 2 execType:
subgraph2: out_prob: NOT_RUN layerType: Output realTime: 0 cpu: 0 execType: unknown
subgraph2: prob: EXECUTED layerType: SoftMax realTime: 10 cpu: 10 execType: ref
Total time: 4212 microseconds

See Also