1# Quantization 2 3This document outlines the design of the MLIR quantization system. While the 4term "quantization" is highly overloaded, in this case, it refers to a fairly 5narrow scope of techniques in use to enable conversion of floating-point 6computations to corresponding and plausible variants expressed in integer math 7for inference, as has historically been supported by low-bit depth inference 8engines such as TFLite, various accelerator hardware, and many DSPs. 9 10Much of this is inspired by the approach taken 11[in this paper](https://arxiv.org/abs/1712.05877) with many extensions and 12adaptations folded in. It specifically documents the positions that MLIR has 13taken on the topic, and is not a general reference. 14 15[TOC] 16 17## Uniform quantization 18 19The primary quantization mechanism supported by MLIR is a scheme which can 20express fixed point and affine transformations via uniformly spaced point on the 21[Real](https://en.wikipedia.org/wiki/Real_number) number line. 22 23Further, the scheme can be applied: 24 25* *per-layer* : Applying to every value within the target type. 26* *per-axis* (also called *per-channel*) : Applying individually to each index 27 along a specific axis of a tensor type. 28 29### Fixed point values 30 31[Fixed point](https://en.wikipedia.org/wiki/Fixed-point_arithmetic) values are a 32[Real](https://en.wikipedia.org/wiki/Real_number) number divided by a *scale*. 33We will call the result of the divided real the *scaled value*. 34 35$$ real\\_value = scaled\\_value * scale $$ 36 37The scale can be interpreted as the distance, in real units, between neighboring 38scaled values. For example, if the scale is $ \pi $, then fixed point values 39with this scale can only represent multiples of $ \pi $, and nothing in 40between. The maximum rounding error to convert an arbitrary Real to a fixed 41point value with a given $ scale $ is $ \frac{scale}{2} $. Continuing the 42previous example, when $ scale = \pi $, the maximum rounding error will be $ 43\frac{\pi}{2} $. 44 45Multiplication can be performed on scaled values with different scales, using 46the same algorithm as multiplication of real values (note that product scaled 47value has $ scale_{product} = scale_{left \mbox{ } operand} * scale_{right 48\mbox{ } operand} $). Addition can be performed on scaled values, so long as 49they have the same scale, using the same algorithm for addition of real values. 50This makes it convenient to represent scaled values on a computer as signed 51integers, and perform arithmetic on those signed integers, because the results 52will be correct scaled values. 53 54### Affine values 55 56Mathematically speaking, affine values are the result of 57[adding a Real-valued *zero point*, to a scaled value](https://en.wikipedia.org/wiki/Affine_transformation#Representation). 58Alternatively (and equivalently), subtracting a zero point from an affine value results in a 59scaled value: 60 61$$ real\\_value = scaled\\_value * scale = (affine\\_value - zero\\_point) * scale $$ 62 63Essentially, affine values are a shift of the scaled values by some constant 64amount. Arithmetic (i.e., addition, subtraction, multiplication, division) 65cannot, in general, be directly performed on affine values; they must first be 66[converted](#affine-to-fixed-point) to the equivalent scaled values. 67 68As alluded to above, the motivation for using affine values is to more 69efficiently represent real values that will actually be encountered during 70computation. Frequently, real values that will be encountered are not 71symmetric around the real zero. We also make the assumption that the real zero 72is encountered during computation, and should thus be represented. 73 74In this case, it is inefficient to store scaled values represented by signed 75integers, as some of the signed integers will never be used. In effect, the bit patterns 76corresponding to those signed integers are going to waste. 77 78In order to exactly represent the real zero with an integral-valued affine 79value, the zero point must be an integer between the minimum and maximum affine 80value (inclusive). For example, given an affine value represented by an 8 bit 81unsigned integer, we have: $ 0 \leq zero\\_point \leq 255 $. This is important, 82because in convolution-like operations of deep neural networks, we frequently 83need to zero-pad inputs and outputs, so zero must be exactly representable, or 84the result will be biased. 85 86### Relation 87 88Real values, fixed point values, and affine values relate through the following 89equation, which demonstrates how to convert one type of number to another: 90 91$$ real\\_value = scaled\\_value * scale = (affine\\_value - zero\\_point) * scale $$ 92 93Note that computers generally store mathematical values using a finite number of 94bits. Thus, while the above conversions are exact, to store the result in a 95finite number of bits, we must, in general, round the result of the conversion 96(this applies to both cases: storing using floating point and storing using 97fixed point). Note that a full discussion of rounding behavior is outside the 98scope of this document, and it is safe to assume unless otherwise stated that 99rounding should be according to the IEEE754 default of RNE (where hardware 100permits). 101 102### Converting between real and fixed point or affine 103 104To convert a real value to a fixed point value, we must know the scale. To 105convert a real value to an affine value, we must know the scale and the zero point. 106 107#### Real to affine 108 109To convert an input tensor of real-valued elements (usually represented by a 110floating point format, frequently 111[Single precision](https://en.wikipedia.org/wiki/Single-precision_floating-point_format)) 112to a tensor of affine elements represented by an integral type (e.g. 8-bit 113unsigned integer), the following conversion can be performed (note that it is 114not required that all representable values of the integral type are used): 115 116$$ 117\begin{align*} 118af&fine\\\_value \\\\ 119 &= clampToTargetSize(roundToNearestInteger( \frac{real\\\_value}{scale}) + zero\\\_point \\\\ 120\end{align*} 121$$ 122 123where we assume the following types: 124 125- `real_value`: Single 126- `scale`: Single 127- `roundToNearestInteger`: returns a 32-bit integer 128- `zero_point`: 8-bit or 16-bit integer 129- `affine_value`: 8-bit or 16-bit integer 130 131Note that bit depth and number of fixed point values are indicative 132of common types on typical hardware but is not constrained to 133particular bit depths or a requirement that the entire range of an 134N-bit integer is used. 135 136#### Affine to real 137 138To convert an output tensor of affine elements represented by uint8 139or uint16 to a tensor of real-valued elements (usually represented with a 140floating point format, frequently Single precision), the following conversion 141can be performed: 142 143$$ 144\begin{align*} 145re&al\\\_value \\\\ 146 &= roundToNearestFloat(affine\\\_value - zero\\\_point) * scale 147\end{align*} 148$$ 149 150where we assume the following types: 151 152- `real_value`: Single 153- `scale`: Single 154- `affine_value`: 8-bit or 16-bit integer 155- `zero_point`: 8-bit or 16-bit integer 156- `roundToNearestFloat`: returns a Single 157- `-` (subtraction): returns a 32-bit signed integer 158 159#### Affine to fixed point 160 161When the affine and fixed point scales are the same, subtract the zero point 162from the affine value to get the equivalent fixed point value. 163 164$$ 165\begin{align*} 166 scaled\\\_value = affine\\\_value_{non\mbox{-}negative} - zero\\\_point_{non\mbox{-}negative} 167\end{align*} 168$$ 169 170#### Fixed point to affine 171 172When the affine and fixed point scales are the same, add the zero point to the 173fixed point value to get the equivalent affine value. 174 175$$ 176\begin{align*} 177 affine\\\_value_{non\mbox{-}negative} = scaled\\\_value + zero\\\_point_{non\mbox{-}negative} 178\end{align*} 179$$ 180 181## Usage within MLIR 182 183There are several components to the quantization system being developed within 184MLIR: 185 186* *Quantization* dialect containing: 187 188 * A family of [QuantizedTypes](#quantized-type) which represent the 189 mapping between *expressed* values (typically of a floating point 190 computer type) and *storage* values (typically of an integral computer 191 type). 192 * [Type conversion ops](#quantized-type-conversion-operations) for converting 193 between types based on a QuantizedType and its *expressed* and *storage* 194 sub-types. 195 * [Instrumentation ops](#instrumentation-and-constraint-operations) for assigning 196 instrumentation points within the computation where runtime statistics 197 may help guide the quantization process. 198 199* [Integration with simulated quantization at training time](#integration-with-simulated-quantization-at-training-time) 200 201* [TFLite native quantization](#tflite-native-quantization) 202 203 * The TFLite op-set natively supports uniform-quantized variants. 204 * Passes and tools exist to convert directly from the *TensorFlow* dialect 205 to the TFLite quantized operation set. 206 207Not every application of quantization will use all of these facilities. Specifically, the 208TensorFlow to TensorFlow Lite conversion uses the QuantizedTypes but has its own 209operations for type conversion and expression of the supporting math. 210 211## Quantization Dialect 212 213### Quantized type 214 215TODO: Flesh this section out. 216 217* QuantizedType base class 218* UniformQuantizedType 219 220### Quantized type conversion operations 221 222* qcast : Convert from an expressed type to QuantizedType 223* dcast : Convert from a QuantizedType to its expressed type 224* scast : Convert between a QuantizedType and its storage type 225 226### Instrumentation and constraint operations 227 228* const_fake_quant : Emulates the logic of the historic TensorFlow 229 fake_quant_with_min_max_args operation. 230* stats_ref : Declares that statistics should be gathered at this point with a 231 unique key and made available to future passes of the solver. 232* stats : Declares inline statistics (per layer and per axis) for the point in 233 the computation. stats_ref ops are generally converted to statistical operations once 234 trial runs have been performed. 235* coupled_ref : Declares points in the computation to be coupled from a type 236 inference perspective based on a unique key. 237 238## Integration with simulated quantization at training time 239 240TensorFlow has historically used the 241[tf.quantization.fake_quant_\*](https://www.tensorflow.org/api_docs/python/tf/quantization/fake_quant_with_min_max_args) 242family of operations to simulate the effect of quantization at training time. 243 244As originally implemented, TensorFlow Lite was the primary user of such 245operations at inference time. When quantized inference was enabled, if every 246eligible tensor passed through an appropriate fake_quant node (the rules of 247which tensors can have fake_quant applied are somewhat involved), then 248TensorFlow Lite would use the attributes of the fake_quant operations to make a 249judgment about how to convert to use kernels from its quantized operations subset. 250 251In MLIR-based quantization, fake_quant_\* operations are handled by converting them to 252a sequence of *qcast* (quantize) followed by *dcast* (dequantize) with an 253appropriate *UniformQuantizedType* as the target of the qcast operation. 254 255This allows subsequent compiler passes to preserve the knowledge that 256quantization was simulated in a certain way, while giving the compiler 257flexibility to move the casts as it simplifies the computation and converts it 258to a form based on integral arithmetic. 259 260This scheme also naturally allows computations that are *partially quantized* 261where the parts which could not be reduced to integral operations are still carried out 262in floating point with appropriate conversions at the boundaries. 263 264## TFLite native quantization 265 266TODO: Flesh this out 267 268### General algorithm 269 2701. Take input min/max information and set the ArrayInfo (which really is 271 InputOrOutputArrayInfo. 2721. In LegalizeTF, convert ArrayInfo min/max to tf.Quantize and tf.Dequantize 273 nodes. (or tf.FakeQuant) Convert all constant FakeQuants to (tf.FQ -> tfl.Q 274 -> tfl.DQ). 2751. Hardcode logic/propagation needs to happen here. 2761. Run TF constant folding. 2771. In PrepareTFL, convert all tf.FQ to (tfl.Q -> tfl.DQ). 2781. Run quantization pass that take (tfl.DQ (for both input and weights) -> op 279 -> tfl.Q) and replaces with (op). Also replace (constant_float -> tfl.Q) 280 with (constant_quant). 281