Intel® Gaussian & Neural Accelerator is a low-power neural co-processor for continuous inference "at the edge".
Intel® GNA is not intended to replace "classic" inference devices such as CPU, GPU or VPU. It is designed for offloading continuous inference workloads including (but not limited to) noise reduction or speech recognition, to save power and free CPU resources
The GNA plugin provides a way to run inference on Intel® GNA, as well as in software execution mode on CPU.
The list of devices with Intel® GNA support includes:
Intel® Speech Enabling Developer Kit
Amazon Alexa* Premium Far-Field Developer Kit
Gemini Lake: Intel® Pentium® Silver J5005 Processor, Intel® Pentium® Silver N5000 Processor, Intel® Celeron® J4005 Processor, Intel® Celeron® J4105 Processor, Intel® Celeron® Processor N4100, Intel® Celeron® Processor N4000
Cannon Lake: Intel® Core™ i3-8121U Processor
Ice Lake: Intel® Core™ i7-1065G7 Processor, Intel® Core™ i7-1060G7 Processor, Intel® Core™ i5-1035G4 Processor, Intel® Core™ i5-1035G7 Processor, Intel® Core™ i5-1035G1 Processor, Intel® Core™ i5-1030G7 Processor, Intel® Core™ i5-1030G4 Processor, Intel® Core™ i3-1005G1 Processor, Intel® Core™ i3-1000G1 Processor, Intel® Core™ i3-1000G4 Processor
NOTE: Please note that on platforms where Intel® GNA is not enabled in the BIOS, the driver cannot be installed, so the GNA Plugin will use software emulation mode only.
Intel® GNA hardware requires a driver to be installed in the system.
Intel® GNA driver for Ubuntu Linux 18.04.3 LTS (with HWE Kernel version 5.0+) can be downloaded here.
Intel® GNA driver for Windows is available through Windows Update.
NOTE: Please note that for the GNA plugin to function on Linux, redistributable libraries for Intel® C++ compilers must be installed. Please see the installation instructions here.
Because of specifics of hardware architecture, Intel® GNA supports a limited set of layers, their kinds and combinations. For example, you should not expect the GNA Plugin to be able to run computer vision models, except those specifically adapted for the GNA Plugin, because the plugin does not fully support 2D convolutions.
The list of supported layers can be found here (see the GNA column of Supported Layers section). Limitations include:
The Intel® GNA hardware natively supports only 1D convolution.
However it is possible to map 2D convolutions to 1D when the convolution kernel moves in only one direction. Such a transformation is performed by the GNA Plugin for Kaldi nnet1 convolution. From this perspective, the Intel® GNA hardware convolution operation accepts NHWC
input and produces NHWC
output. Since OpenVINO™ only supports NCHW
layout, it may be necessary to insert Permute
layers before/after convolutions.
For example, the Kaldi model optimizer inserts such a permute after convolution for the rm_cnn4a network. This Permute
layer is automatically removed by the GNA Plugin since the Intel® GNA hardware convolution layer already produces the required NHWC
result.
Intel® GNA essentially operates in low-precision mode, which represents a mix of 8-bit computations (referred to as I8
), 16-bit computations (referred to as I16
) and 32-bit integer computations (referred to as I32
), so compared to 32-bit floating point (referred to as FP32
results) - for example, calculated on CPU using Inference Engine CPU Plugin - outputs calculated using reduced integer precision will be different than the scores calculated using floating point.
Unlike other plugins supporting low-precision execution, GNA Plugin calculates quantization factors at the model loading time, so no calibration is needed for a model to be run.
Mode | Description |
---|---|
GNA_AUTO | Uses Intel® GNA if it is available and software execution mode on CPU otherwise |
GNA_HW | Uses Intel® GNA. Returns error if Intel® GNA is not available |
GNA_SW | Deprecated mode. Executes the GNA-compiled graph on CPU performing calculations in the same precision as the Intel® GNA does, but not in bit-exact mode |
GNA_SW_EXACT | Executes the GNA-compiled graph on CPU performing calculations in the same precision as the Intel® GNA does in bit-exact mode |
GNA_SW_FP32 | Executes the GNA-compiled graph on CPU but substitutes parameters and calculations from low precision to floating point (FP32 ) |
The plugin supports the configuration parameters listed below. The parameters are passed as std::map<std::string, std::string>
on InferenceEngine::Core::LoadNetwork
.
Parameter Name | Parameter Values | Default | Description |
---|---|---|---|
GNA_COMPACT_MODE | YES /NO | YES | Reuse I/O buffers to save space (makes debugging harder) |
GNA_SCALE_FACTOR | FP32 number | 1.0 | Scale factor to use for input quantization |
KEY_GNA_DEVICE_MODE | GNA_AUTO /GNA_HW /GNA_SW_EXACT /GNA_SW_FP32 | GNA_AUTO | One of the modes described here |
KEY_GNA_FIRMWARE_MODEL_IMAGE | std::string | "" | Name for embedded model binary dump file |
KEY_GNA_PRECISION | I16 /I8 | I16 | Hint to GNA plugin: preferred integer weight resolution for quantization |
KEY_PERF_COUNT | YES /NO | NO | Turn on performance counters reporting |
KEY_GNA_LIB_N_THREADS | 1-127 integer number | 1 | Sets the number of GNA accelerator library worker threads used for inference computation in software modes |
As a result of collecting performance counters using InferenceEngine::IInferencePlugin::GetPerformanceCounts
, you can find various performance data about execution on GNA. Returned map stores a counter description as a key, counter value is stored in the field realTime_uSec
of InferenceEngineProfileInfo
structure. Current GNA implementation calculates counters for whole utterance scoring and does not provide "per layer" information. API allows to retrieve counter units in cycles, but they can be converted to seconds as follows:
Intel® Ice Lake processors and Intel® Core™ i3-8121U processor includes Intel® GNA with frequency 400MHz, Intel® Gemini Lake processors - 200MHz.
Performance counters provided for the time being:
The GNA plugin supports the following configuration parameters for multithreading management:
KEY_GNA_LIB_N_THREADS
By default, the GNA plugin uses one worker thread for inference computations. This parameter allows you to create up to 127 threads for software modes.
NOTE: Please note multithreading mode does not guarantee the same computation order as the order of issuing. Additionally, in this case, software modes do not implement any serializations.
The GNA plugin supports the processing of context-windowed speech frames in batches of 1-8 frames in one input blob using InferenceEngine::ICNNNetwork::setBatchSize
. Increasing batch size only improves efficiency of Fully Connected
layers.
NOTE: Please note for networks with
Convolutional
,LSTM
orMemory
layers, the only supported batch size is 1.
Heterogeneous plugin was tested with GNA as the primary device and CPU as a secondary. For running inference of networks with layers unsupported by the GNA plugin (for example, Softmax), you can use the Heterogeneous plugin with the following configuration HETERO:GNA,CPU
. For the list of supported networks, see the Supported Frameworks.
NOTE: Please note due to limitation of GNA backend library, heterogenous support is limited to cases where in resulted sliced graph there only one subgraph is scheduled to run on GNA_HW or GNA_SW devices.