Low-precision 8-bit inference is optimized for:
8-bit computations (referred to as
int8) offer better performance compared to the results of inference in higher precision (for example,
fp32), because they allow loading more data into a single processor instruction. Usually the cost for significant boost is reduced accuracy. However, it is proved that an accuracy drop can be negligible and depends on task requirements, so that the application engineer can set up the maximum accuracy drop that is acceptable.
For 8-bit integer computations, a model must be quantized. Quantized models can be downloaded from Overview of OpenVINO™ Toolkit Intel's Pre-Trained Models. If the model is not quantized, you can use the Post-Training Optimization Tool to quantize the model. The quantization process adds FakeQuantize layers on activations and weights for most layers. Read more about mathematical computations in the Uniform Quantization with Fine-Tuning.
When you pass the quantized IR to the OpenVINO™ plugin, the plugin automatically recognizes it as a quantized model and performs 8-bit inference. Note, if you pass a quantized model to another plugin that does not support 8-bit inference but supports all operations from the model, the model is inferred in precision that this plugin supports.
In Runtime stage, the quantized model is loaded to the plugin. The plugin uses the
Low Precision Transformation component to update the model to infer it in low precision:
FakeQuantizelayers to have quantized output tensors in a low precision range and add dequantization layers to compensate the update. Dequantization layers are pushed through as many layers as possible to have more layers in low precision. After that, most layers quantized input tensors in the low precision range and can be inferred in low precision. Ideally, dequantization layers should be fused in the next
The simplest way to infer the model and collect performance counters is the C++ Benchmark Application.
If you infer the model with the Inference Engine CPU plugin and collect performance counters, all operations (except the last non-quantized SoftMax) are executed in INT8 precision.
Information about layer precision is stored in the performance counters that are available from the Inference Engine API. For example, the part of performance counters table for quantized TensorFlow* implementation of ResNet-50 model inference on CPU Plugin looks as follows:
|layerName||execStatus||layerType||execType||realTime (ms)||cpuTime (ms)|
exeStatuscolumn of the table includes possible values:
EXECUTED- layer was executed by standalone primitive,
NOT_RUN- layer was not executed by standalone primitive or was fused with another operation and executed in another layer primitive.
execTypecolumn of the table includes inference primitives with specific suffixes. The layers have the following marks:
I8for layers that had 8-bit data type input and were computed in 8-bit precision
FP32for layers computed in 32-bit precision
Convolution layers are executed in int8 precision. Rest layers are fused into Convolutions using post operations optimization technique, which is described in Internal CPU Plugin Optimizations.