Intel® Gaussian & Neural Accelerator is a low-power neural coprocessor for continuous inference at the edge.
Intel® GNA is not intended to replace classic inference devices such as CPU, graphics processing unit (GPU), or vision processing unit (VPU) . It is designed for offloading continuous inference workloads including but not limited to noise reduction or speech recognition to save power and free CPU resources.
The GNA plugin provides a way to run inference on Intel® GNA, as well as in the software execution mode on CPU.
Devices with Intel® GNA support:
NOTE: On platforms where Intel® GNA is not enabled in the BIOS, the driver cannot be installed, so the GNA plugin uses the software emulation mode only.
Intel® GNA hardware requires a driver to be installed on the system.
Because of specifics of hardware architecture, Intel® GNA supports a limited set of layers, their kinds and combinations. For example, you should not expect the GNA Plugin to be able to run computer vision models, except those specifically adapted for the GNA Plugin, because the plugin does not fully support 2D convolutions.
The list of supported layers can be found here (see the GNA column of Supported Layers section). Limitations include:
The Intel® GNA hardware natively supports only 1D convolution.
However, 2D convolutions can be mapped to 1D when a convolution kernel moves in a single direction. Such a transformation is performed by the GNA Plugin for Kaldi nnet1
convolution. From this perspective, the Intel® GNA hardware convolution operation accepts a NHWC
input and produces NHWC
output. Because OpenVINO™ only supports the NCHW
layout, it may be necessary to insert Permute
layers before or after convolutions.
For example, the Kaldi model optimizer inserts such a permute after convolution for the rm_cnn4a network. This Permute
layer is automatically removed by the GNA Plugin, because the Intel® GNA hardware convolution layer already produces the required NHWC
result.
Intel® GNA essentially operates in the low-precision mode, which represents a mix of 8-bit (I8
), 16-bit (I16
), and 32-bit (I32
) integer computations, so compared to 32-bit floating point (FP32
) results – for example, calculated on CPU using Inference Engine CPU Plugin – outputs calculated using reduced integer precision are different from the scores calculated using floating point.
Unlike other plugins supporting low-precision execution, the GNA plugin calculates quantization factors at the model loading time, so a model can run without calibration.
Mode | Description |
---|---|
GNA_AUTO | Uses Intel® GNA if available, otherwise uses software execution mode on CPU. |
GNA_HW | Uses Intel® GNA if available, otherwise raises an error. |
GNA_SW | Deprecated. Executes the GNA-compiled graph on CPU performing calculations in the same precision as the Intel® GNA, but not in the bit-exact mode. |
GNA_SW_EXACT | Executes the GNA-compiled graph on CPU performing calculations in the same precision as the Intel® GNA in the bit-exact mode. |
GNA_SW_FP32 | Executes the GNA-compiled graph on CPU but substitutes parameters and calculations from low precision to floating point (FP32 ). |
The plugin supports the configuration parameters listed below. The parameters are passed as std::map<std::string, std::string>
on InferenceEngine::Core::LoadNetwork
or InferenceEngine::SetConfig
.
The parameter KEY_GNA_DEVICE_MODE
can also be changed at run time using InferenceEngine::ExecutableNetwork::SetConfig
(for any values excluding GNA_SW_FP32
). This allows switching the execution between software emulation mode and hardware emulation mode after the model is loaded.
The parameter names below correspond to their usage through API keys, such as GNAConfigParams::KEY_GNA_DEVICE_MODE
or PluginConfigParams::KEY_PERF_COUNT
. When specifying key values as raw strings (that is, when using Python API), omit the KEY_
prefix.
As a result of collecting performance counters using InferenceEngine::InferRequest::GetPerformanceCounts
, you can find various performance data about execution on GNA. Returned map stores a counter description as a key, counter value is stored in the realTime_uSec
field of the InferenceEngineProfileInfo
structure. Current GNA implementation calculates counters for the whole utterance scoring and does not provide per-layer information. API allows to retrieve counter units in cycles, but they can be converted to seconds as follows:
Refer to the table below to learn about the frequency of Intel® GNA inside a particular processor.
Processor | Frequency of Intel® GNA |
---|---|
Intel® Ice Lake processors | 400MHz |
Intel® Core™ i3-8121U processor | 400MHz |
Intel® Gemini Lake processors | 200MHz |
Performance counters provided for the time being:
The GNA plugin supports the following configuration parameters for multithreading management:
KEY_GNA_LIB_N_THREADS
By default, the GNA plugin uses one worker thread for inference computations. This parameter allows you to create up to 127 threads for software modes.
NOTE: Multithreading mode does not guarantee the same computation order as the order of issuing. Additionally, in this case, software modes do not implement any serializations.
Intel® GNA plugin supports the processing of context-windowed speech frames in batches of 1-8 frames in one input blob using InferenceEngine::ICNNNetwork::setBatchSize
. Increasing batch size only improves efficiency of Fully Connected
layers.
NOTE: For networks with
Convolutional
,LSTM
, orMemory
layers, the only supported batch size is 1.
Heterogeneous plugin was tested with the Intel® GNA as a primary device and CPU as a secondary device. To run inference of networks with layers unsupported by the GNA plugin (for example, Softmax), use the Heterogeneous plugin with the HETERO:GNA,CPU
configuration. For the list of supported networks, see the Supported Frameworks.
NOTE: Due to limitation of the Intel® GNA backend library, heterogenous support is limited to cases where in the resulted sliced graph, only one subgraph is scheduled to run on GNA_HW or GNA_SW devices.