GNA Plugin

Introducing the GNA Plugin

Intel® Gaussian & Neural Accelerator is a low-power neural co-processor for continuous inference "at the edge".

Intel® GNA is not intended to replace "classic" inference devices such as CPU, GPU or VPU. It is designed for offloading continuous inference workloads including (but not limited to) noise reduction or speech recognition, to save power and free CPU resources

The GNA plugin provides a way to run inference on Intel® GNA, as well as in software execution mode on CPU.

Devices with Intel® GNA

The list of devices with Intel® GNA support includes:

Intel® Speech Enabling Developer Kit

Amazon Alexa* Premium Far-Field Developer Kit

Gemini Lake: Intel® Pentium® Silver J5005 Processor, Intel® Pentium® Silver N5000 Processor, Intel® Celeron® J4005 Processor, Intel® Celeron® J4105 Processor, Intel® Celeron® Processor N4100, Intel® Celeron® Processor N4000

Cannon Lake: Intel® Core™ i3-8121U Processor

Ice Lake: Intel® Core™ i7-1065G7 Processor, Intel® Core™ i7-1060G7 Processor, Intel® Core™ i5-1035G4 Processor, Intel® Core™ i5-1035G7 Processor, Intel® Core™ i5-1035G1 Processor, Intel® Core™ i5-1030G7 Processor, Intel® Core™ i5-1030G4 Processor, Intel® Core™ i3-1005G1 Processor, Intel® Core™ i3-1000G1 Processor, Intel® Core™ i3-1000G4 Processor

NOTE: Please note that on platforms where Intel® GNA is not enabled in the BIOS, the driver cannot be installed, so the GNA Plugin will use software emulation mode only.

Drivers and Dependencies

Intel® GNA hardware requires a driver to be installed in the system.

Intel® GNA driver for Ubuntu Linux 18.04.3 LTS (with HWE Kernel version 5.0+) can be downloaded here.

Intel® GNA driver for Windows is available through Windows Update.

NOTE: Please note that for the GNA plugin to function on Linux, redistributable libraries for Intel® C++ compilers must be installed. Please see the installation instructions here.

Models and Layers Limitations

Because of specifics of hardware architecture, Intel® GNA supports a limited set of layers, their kinds and combinations. For example, you should not expect the GNA Plugin to be able to run computer vision models, except those specifically adapted for the GNA Plugin, because the plugin does not fully support 2D convolutions.

The list of supported layers can be found here (see the GNA column of Supported Layers section). Limitations include:

Experimental Support for 2D Convolutions

The Intel® GNA hardware natively supports only 1D convolution.

However it is possible to map 2D convolutions to 1D when the convolution kernel moves in only one direction. Such a transformation is performed by the GNA Plugin for Kaldi nnet1 convolution. From this perspective, the Intel® GNA hardware convolution operation accepts NHWC input and produces NHWC output. Since OpenVINO™ only supports NCHW layout, it may be necessary to insert Permute layers before/after convolutions.

For example, the Kaldi model optimizer inserts such a permute after convolution for the rm_cnn4a network. This Permute layer is automatically removed by the GNA Plugin since the Intel® GNA hardware convolution layer already produces the required NHWC result.

Operation Precision

Intel® GNA essentially operates in low-precision mode, which represents a mix of 8-bit computations (referred to as I8), 16-bit computations (referred to as I16) and 32-bit integer computations (referred to as I32), so compared to 32-bit floating point (referred to as FP32 results) - for example, calculated on CPU using Inference Engine CPU Plugin - outputs calculated using reduced integer precision will be different than the scores calculated using floating point.

Unlike other plugins supporting low-precision execution, GNA Plugin calculates quantization factors at the model loading time, so no calibration is needed for a model to be run.

Execution Modes

Mode Description
GNA_AUTO Uses Intel® GNA if it is available and software execution mode on CPU otherwise
GNA_HW Uses Intel® GNA. Returns error if Intel® GNA is not available
GNA_SW Deprecated mode. Executes the GNA-compiled graph on CPU performing calculations in the same precision as the Intel® GNA does, but not in bit-exact mode
GNA_SW_EXACT Executes the GNA-compiled graph on CPU performing calculations in the same precision as the Intel® GNA does in bit-exact mode
GNA_SW_FP32 Executes the GNA-compiled graph on CPU but substitutes parameters and calculations from low precision to floating point (FP32)

Supported Configuration Parameters

The plugin supports the configuration parameters listed below. The parameters are passed as std::map<std::string, std::string> on InferenceEngine::Core::LoadNetwork.

Parameter Name Parameter Values Default Description
GNA_COMPACT_MODE YES/NO YES Reuse I/O buffers to save space (makes debugging harder)
GNA_SCALE_FACTOR FP32 number 1.0 Scale factor to use for input quantization
KEY_GNA_FIRMWARE_MODEL_IMAGE std::string "" Name for embedded model binary dump file
KEY_GNA_PRECISION I16/I8 I16 Hint to GNA plugin: preferred integer weight resolution for quantization
KEY_PERF_COUNT YES/NO NO Turn on performance counters reporting
KEY_GNA_LIB_N_THREADS 1-127 integer number 1 Sets the number of GNA accelerator library worker threads used for inference computation in software modes

How to Interpret Performance Counters

As a result of collecting performance counters using InferenceEngine::IInferencePlugin::GetPerformanceCounts, you can find various performance data about execution on GNA. Returned map stores a counter description as a key, counter value is stored in the field realTime_uSec of InferenceEngineProfileInfo structure. Current GNA implementation calculates counters for whole utterance scoring and does not provide "per layer" information. API allows to retrieve counter units in cycles, but they can be converted to seconds as follows:

seconds = cycles / frequency

Intel® Ice Lake processors and Intel® Core™ i3-8121U processor includes Intel® GNA with frequency 400MHz, Intel® Gemini Lake processors - 200MHz.

Performance counters provided for the time being:

Multithreading Support in GNA Plugin

The GNA plugin supports the following configuration parameters for multithreading management:

NOTE: Please note multithreading mode does not guarantee the same computation order as the order of issuing. Additionally, in this case, software modes do not implement any serializations.

Network Batch Size

The GNA plugin supports the processing of context-windowed speech frames in batches of 1-8 frames in one input blob using InferenceEngine::ICNNNetwork::setBatchSize. Increasing batch size only improves efficiency of Fully Connected layers.

NOTE: Please note for networks with Convolutional, LSTM or Memory layers, the only supported batch size is 1.

Compatibility with Heterogeneous Plugin

Heterogeneous plugin was tested with GNA as the primary device and CPU as a secondary. For running inference of networks with layers unsupported by the GNA plugin (for example, Softmax), you can use the Heterogeneous plugin with the following configuration HETERO:GNA,CPU. For the list of supported networks, see the Supported Frameworks.

NOTE: Please note due to limitation of GNA backend library, heterogenous support is limited to cases where in resulted sliced graph there only one subgraph is scheduled to run on GNA_HW or GNA_SW devices.

See Also