Quantization

The primary optimization feature of the toolkit is a uniform quantization. In general, this method supports an arbitrary number of bits (>=2) which is used to represent weights and activations. During the quantization process, the so-called FakeQuantize operations are inserted into the model graph automatically based on the predefined hardware target in order to produce the most hardware-friendly optimized model. After that, different quantization algorithms can tune the FakeQuantize parameters or remove some operations in order to meet the accuracy criteria. The resulting "fakequantized" models can be interpreted and transformed to real low-precision models at runtime getting real performance improvement.

Quantization algorithms

The toolkit provides multiple quantization and auxiliary algorithms which help to restore the accuracy after quantizing weights and activations. Potentially, algorithms can form independent optimization pipelines that can be applied to quantize one or another model. However, the only two following quantization algorithms for 8-bits precision are verified and recommended for use to get stable and confident results for DNN model quantization:

Quantization formula

Quantization is parametrized by clamping range and number of quantization levels:

\[ output = \frac{\left\lfloor (clamp(input; input\_low, input\_high)-input\_low) *s\right \rceil}{s} + input\_low\\ \]

\[ clamp(input; input\_low, input\_high) = min(max(input, input\_low), input\_high))) \]

\[ s=\frac{levels-1}{input\_high - input\_low} \]

input_low and input_high represents the quantization range and

\[\left\lfloor\cdot\right \rceil\]

denotes rounding to the nearest integer.

The toolkit support two quantization modes: symmetric and asymmetric. The main difference between them is that in the case of the symmetric mode the floating-point zero is mapped directly to integer zero. For asymmetric mode it can be any integer number but in any case the floating-point zero is mapped directly to the quant without rounding error.

Symmetric quantization

The formula is parametrized by the scale parameter that is tuned during quantization process:

\[ input\_low=scale*\frac{level\_low}{level\_high} \]

\[ input\_high=scale \]

Where level_low and level_high represent the range of the discrete signal.

*For signed activations:

\[ level\_low=-2^{bits-1} \]

\[ level\_high=2^{bits-1}-1 \]

\[ levels=256 \]

Asymmetric quantization

The quantization formula is parametrized by input_low and input_range that are tunable parameters:

\[ input\_high=input\_low + input\_range \]

\[ levels=256 \]

For weights and activations the following quantization mode is applied:

\[ {input\_low}' = min(input\_low, 0) \]

\[ {input\_high}' = max(input\_high, 0) \]

\[ ZP= \left\lfloor \frac{-{input\_low}'*(levels-1)}{{input\_high}'-{input\_low}'} \right \rceil \]

\[ {input\_high}''=\frac{ZP-levels+1}{ZP}*{input\_low}' \]

\[ {input\_low}''=\frac{ZP}{ZP-levels+1}*{input\_high}' \]

\[ {input\_low,input\_high} = \begin{cases} {input\_low}',{input\_high}', & ZP \in $\{0,levels-1\}$ \\ {input\_low}',{input\_high}'', & {input\_high}'' - {input\_low}' > {input\_high}' - {input\_low}'' \\ {input\_low}'',{input\_high}', & {input\_high}'' - {input\_low}' <= {input\_high}' - {input\_low}''\\ \end{cases} \]