Quantization

The primary optimization feature of the toolkit is a uniform quantization. In general, this method supports an arbitrary number of bits (>=2) which is used to represent weights and activations. During the quantization process, the so-called `FakeQuantize`

operations are inserted into the model graph automatically based on the predefined hardware target in order to produce the most hardware-friendly optimized model. After that, different quantization algorithms can tune the `FakeQuantize`

parameters or remove some operations in order to meet the accuracy criteria. The resulting "fakequantized" models can be interpreted and transformed to real low-precision models at runtime getting real performance improvement.

The toolkit provides multiple quantization and auxiliary algorithms, which help to restore the accuracy after quantizing weights and activations. Algorithms can form independent optimization pipelines you can apply to quantize a model. However, the only two following quantization algorithms for 8-bit precision are verified and recommended to get stable and confident results for DNN model quantization:

**DefaultQuantization**is used as a default method to get fast but in most cases accurate results for 8-bit quantization. For details, see DefaultQuantization Algorithm documentation.**AccuracyAwareQuantization**is used to stay at a predefined range of accuracy drop after quantization at the cost of performance improvement. It may require more time for quantization. For details, see the AccuracyAwareQuantization Algorithm documentation.

**TunableQuantization** enables tuning of quantization hyperparameters. It is a tunable variant of **MinMaxQuantization**, which is a part of **DefaultQuantization** pipeline, and provided for use with a global optimizer to tune a possible quantization scheme based on a predefined accuracy drop and latency improvement criteria. TunableQuantization is usually used as a part of a pipeline with auxiliary algorithms. See TunableQuantization Algorithm documentation.

Quantization is parametrized by clamping range and number of quantization levels:

`input_low`

and `input_high`

represents the quantization range and

denotes rounding to the nearest integer.

The toolkit support two quantization modes: symmetric and asymmetric. The main difference between them is that in the case of the symmetric mode the floating-point zero is mapped directly to integer zero. For asymmetric mode it can be any integer number but in any case the floating-point zero is mapped directly to the quant without rounding error.

The formula is parametrized by the `scale`

parameter that is tuned during quantization process:

Where `level_low`

and `level_high`

represent the range of the discrete signal.

For weights:

For unsigned activations:

For signed activations:

The quantization formula is parametrized by `input_low`

and `input_range`

that are tunable parameters:

For weights and activations the following quantization mode is applied: