The primary optimization feature of the toolkit is a uniform quantization. In general, this method supports an arbitrary number of bits (>=2) which is used to represent weights and activations. During the quantization process, the so-called
FakeQuantize operations are inserted into the model graph automatically based on the predefined hardware target in order to produce the most hardware-friendly optimized model. After that, different quantization algorithms can tune the
FakeQuantize parameters or remove some operations in order to meet the accuracy criteria. The resulting "fakequantized" models can be interpreted and transformed to real low-precision models at runtime getting real performance improvement.
The toolkit provides multiple quantization and auxiliary algorithms, which help to restore the accuracy after quantizing weights and activations. Algorithms can form independent optimization pipelines you can apply to quantize a model. However, the only two following quantization algorithms for 8-bit precision are verified and recommended to get stable and confident results for DNN model quantization:
TunableQuantization enables tuning of quantization hyperparameters. It is a tunable variant of MinMaxQuantization, which is a part of DefaultQuantization pipeline, and provided for use with a global optimizer to tune a possible quantization scheme based on a predefined accuracy drop and latency improvement criteria. TunableQuantization is usually used as a part of a pipeline with auxiliary algorithms. See TunableQuantization Algorithm documentation.
Quantization is parametrized by clamping range and number of quantization levels:
input_high represents the quantization range and
denotes rounding to the nearest integer.
The toolkit support two quantization modes: symmetric and asymmetric. The main difference between them is that in the case of the symmetric mode the floating-point zero is mapped directly to integer zero. For asymmetric mode it can be any integer number but in any case the floating-point zero is mapped directly to the quant without rounding error.
The formula is parametrized by the
scale parameter that is tuned during quantization process:
level_high represent the range of the discrete signal.
For unsigned activations:
For signed activations:
The quantization formula is parametrized by
input_range that are tunable parameters:
For weights and activations the following quantization mode is applied: