The primary optimization feature of the toolkit is a uniform quantization. In general, this method supports an arbitrary number of bits (>=2) which is used to represent weights and activations. During the quantization process, the so-called FakeQuantize
operations are inserted into the model graph automatically based on the predefined hardware target in order to produce the most hardware-friendly optimized model. After that, different quantization algorithms can tune the FakeQuantize
parameters or remove some operations in order to meet the accuracy criteria. The resulting "fakequantized" models can be interpreted and transformed to real low-precision models at runtime getting real performance improvement.
The toolkit provides multiple quantization and auxiliary algorithms which help to restore the accuracy after quantizing weights and activations. Potentially, algorithms can form independent optimization pipelines that can be applied to quantize one or another model. However, the only two following quantization algorithms for 8-bits precision are verified and recommended for use to get stable and confident results for DNN model quantization:
Quantization is parametrized by clamping range and number of quantization levels:
input_low
and input_high
represents the quantization range and
denotes rounding to the nearest integer.
The toolkit support two quantization modes: symmetric and asymmetric. The main difference between them is that in the case of the symmetric mode the floating-point zero is mapped directly to integer zero. For asymmetric mode it can be any integer number but in any case the floating-point zero is mapped directly to the quant without rounding error.
The formula is parametrized by the scale
parameter that is tuned during quantization process:
Where level_low
and level_high
represent the range of the discrete signal.
For weights:
For unsigned activations:
*For signed activations:
The quantization formula is parametrized by input_low
and input_range
that are tunable parameters:
For weights and activations the following quantization mode is applied: