AccuracyAwareQuantization Algorithm

AccuracyAware algorithm is designed to perform accurate 8-bit quantization and allows the model to stay in the pre-defined range of accuracy drop, for example 1%. This may cause a degradation in performance in comparison to DefaultQuantization algorithm because some layers can be reverted back to the original precision. Generally, the algorithm consists of the following steps:

- The model gets fully quantized using the DefaultQuantization algorithm.
- The quantized and full-precision models are compared on a subset of the validation set in order to find mismatches in the target accuracy metric. A ranking subset is extracted based on the mismatches.
- Optionally, if the accuracy criteria cannot be satisfied with fully symmetric quantization, the quantized model gets converted to mixed mode, and step 2 is repeated.
- A layer-wise ranking is performed in order to get a contribution of each quantized layer into the accuracy drop.
- Based on the ranking, the most "problematic" layer is reverted back to the original precision. This change is followed by the evaluation of the obtained model on the full validation set in order to get a new accuracy drop.
- If the accuracy criteria are satisfied for all pre-defined accuracy metrics, the algorithm finishes. Otherwise, it continues reverting the next "problematic" layer.
- It may happen that regular reverting does not get any accuracy improvement or even worsen the accuracy. Then the re-ranking is triggered as it is described in step 4.

Since the DefaultQuantization algorithm is used as an initialization, all its parameters are also valid and can be specified. Here we describe only AccuracyAware specific parameters:

"name": "AccuracyAwareQuantization", // compression algorithm name

"params": {

"ranking_subset_size": 300, // A size of a subset which is used to rank layers by their contribution to the accuracy drop

"max_iter_num": 30, // Maximum number of iterations of the algorithm (maximum of layers that may be reverted back to full-precision)

"maximal_drop": 0.005, // Maximum accuracy drop which has to be achieved after the quantization

"drop_type": "absolute", // Drop type of the accuracy metric: relative or absolute (default)

"use_prev_if_drop_increase": false, // Whether to use NN snapshot from the previous algorithm iteration in case if drop increases

"base_algorithm": "DefaultQuantization", // Base algorithm that is used to quantize model at the beginning

"convert_to_mixed_preset": false, // Whether to convert the model to mixed mode if the accuracy criteria

// of the symmetrically quantized model are not satisfied

"metrics": [ // An optional list of metrics that are taken into account during optimization

// If not specified, all metrics defined in engine config are used

{

"name": "accuracy", // Metric name to optimize

"baseline_value": 0.72 // Baseline metric value of the original model

}

],

"metric_subset_ratio": 0.5 // A part of the validation set that is used to compare element-wise full-precision and

// quantized models in case of predefined metric values of the original model

}