One of the feature of Inference Engine is the support of quantized networks with different precisions: INT8, INT4, etc. However, it is up to the plugin to define what exact precisions are supported by the particular HW. All quantized networks which can be expressed in IR have a unified representation by means of FakeQuantize operation. For more details about low-precision model representation please refer to this document.
During the model load each plugin can interpret quantization rules expressed in FakeQuantize operations:
Here we provide only a high-level overview of the interpretation rules of FakeQuantize. At runtime each FakeQuantize can be split into two independent operations: Quantize and Dequantize. The former one is aimed to transform the input data into the target precision while the latter transforms the resulting values back to the original range and precision. In practice Dequantize operations can be propagated forward through the linear operations, such as Convolution or Fully-Connected, and in some cases fused with the following Quantize operation for the next layer into the so-called Requantize operation (see Fig. 1).
From the calculation standpoint, the FakeQuantize formula also is split into two parts accordingly:
output = round((x - input_low) / (input_high - input_low) * (levels-1)) / (levels-1) * (output_high - output_low) + output_low
The first part of this formula represents Quantize operation:
q = round((x - input_low) / (input_high - input_low) * (levels-1))
The second is responsible for the dequantization:
r = q / (levels-1) * (output_high - output_low) + output_low
From the scale/zero-point notation standpoint the latter formula can be written as follows:
r = (output_high - output_low) / (levels-1) * (q + output_low / (output_high - output_low) * (levels-1))
Thus we can define:
(output_high - output_low) / (levels-1)
-output_low / (output_high - output_low) * (levels-1)
Note: During the quantization process the values
output_high are selected so that to map a floating-point zero exactly to an integer value (zero-point) and vice versa.
In general, OpenVINO can represent and execute quantized models from different sources. However, the Post-training Optimization Toolkit (POT) is considered the default way to get optimized models. Since the POT supports HW-aware quantization it means that specific rules can be implemented in it for the particular HW. However, it is reasonable to have compatibility with general-purpose HW such as CPU and GPU and support their quantization schemes. Below we define these rules as follows: