xref: /llvm-project/mlir/docs/Quantization.md (revision d192a4ab2b8c0f80efcb006a4b200ad3ba73d485)
1# MLIR Quantization
2
3This document outlines the design of the MLIR quantization system. While the
4term "quantization" is highly overloaded, in this case, it refers to a fairly
5narrow scope of techniques in use to enable conversion of floating-point
6computations to corresponding and plausible variants expressed in integer math
7for inference, as has historically been supported by low-bit depth inference
8engines such as TFLite, various accelerator hardware, and many DSPs.
9
10Much of this is inspired by the approach taken
11[in this paper](https://arxiv.org/abs/1712.05877) with many extensions and
12adaptations folded in. It specifically documents the positions that MLIR has
13taken on the topic, and is not a general reference.
14
15[TOC]
16
17## Uniform quantization
18
19The primary quantization mechanism supported by MLIR is a scheme which can
20express fixed point and affine transformations via uniformly spaced point on the
21[Real](https://en.wikipedia.org/wiki/Real_number) number line.
22
23Further, the scheme can be applied:
24
25*   *per-layer* : Applying to every value within the target type.
26*   *per-axis* (also called *per-channel*) : Applying individually to each index
27    along a specific axis of a tensor type.
28
29### Fixed point values
30
31[Fixed point](https://en.wikipedia.org/wiki/Fixed-point_arithmetic) values are a
32[Real](https://en.wikipedia.org/wiki/Real_number) number divided by a *scale*.
33We will call the result of the divided real the *scaled value*.
34
35$$ real\_value = scaled\_value * scale $$
36
37The scale can be interpreted as the distance, in real units, between neighboring
38scaled values. For example, if the scale is $$ \pi $$, then fixed point values
39with this scale can only represent multiples of $$ \pi $$, and nothing in
40between. The maximum rounding error to convert an arbitrary Real to a fixed
41point value with a given $$ scale $$ is $$ \frac{scale}{2} $$. Continuing the
42previous example, when $$ scale = \pi $$, the maximum rounding error will be $$
43\frac{\pi}{2} $$.
44
45Multiplication can be performed on scaled values with different scales, using
46the same algorithm as multiplication of real values (note that product scaled
47value has $$ scale_{product} = scale_{left \mbox{ } operand} * scale_{right
48\mbox{ } operand} $$). Addition can be performed on scaled values, so long as
49they have the same scale, using the same algorithm for addition of real values.
50This makes it convenient to represent scaled values on a computer as signed
51integers, and perform arithmetic on those signed integers, because the results
52will be correct scaled values.
53
54### Affine values
55
56Mathematically speaking, affine values are the result of
57[adding a Real-valued *zero point*, to a scaled value](https://en.wikipedia.org/wiki/Affine_transformation#Representation).
58Alternatively (and equivalently), subtracting a zero point from an affine value results in a
59scaled value:
60
61$$ real\_value = scaled\_value * scale = (affine\_value - zero\_point) * scale $$
62
63Essentially, affine values are a shift of the scaled values by some constant
64amount. Arithmetic (i.e., addition, subtraction, multiplication, division)
65cannot, in general, be directly performed on affine values; they must first be
66[converted](#affine-to-fixed-point) to the equivalent scaled values.
67
68As alluded to above, the motivation for using affine values is to more
69efficiently represent real values that will actually be encountered during
70computation. Frequently, real values that will be encountered are not
71symmetric around the real zero. We also make the assumption that the real zero
72is encountered during computation, and should thus be represented.
73
74In this case, it is inefficient to store scaled values represented by signed
75integers, as some of the signed integers will never be used. In effect, the bit patterns
76corresponding to those signed integers are going to waste.
77
78In order to exactly represent the real zero with an integral-valued affine
79value, the zero point must be an integer between the minimum and maximum affine
80value (inclusive). For example, given an affine value represented by an 8 bit
81unsigned integer, we have: $$ 0 \leq zero\_point \leq 255$$. This is important,
82because in convolution-like operations of deep neural networks, we frequently
83need to zero-pad inputs and outputs, so zero must be exactly representable, or
84the result will be biased.
85
86### Relation
87
88Real values, fixed point values, and affine values relate through the following
89equation, which demonstrates how to convert one type of number to another:
90
91$$ real\_value = scaled\_value * scale = (affine\_value - zero\_point) * scale $$
92
93Note that computers generally store mathematical values using a finite number of
94bits. Thus, while the above conversions are exact, to store the result in a
95finite number of bits, we must, in general, round the result of the conversion
96(this applies to both cases: storing using floating point and storing using
97fixed point). Note that a full discussion of rounding behavior is outside the
98scope of this document, and it is safe to assume unless otherwise stated that
99rounding should be according to the IEEE754 default of RNE (where hardware
100permits).
101
102### Converting between real and fixed point or affine
103
104To convert a real value to a fixed point value, we must know the scale. To
105convert a real value to an affine value, we must know the scale and the zero point.
106
107#### Real to affine
108
109To convert an input tensor of real-valued elements (usually represented by a
110floating point format, frequently
111[Single precision](https://en.wikipedia.org/wiki/Single-precision_floating-point_format))
112to a tensor of affine elements represented by an integral type (e.g. 8-bit
113unsigned integer), the following conversion can be performed (note that it is
114not required that all representable values of the integral type are used):
115
116$$
117\begin{align*}
118af&fine\_value_{uint8 \, or \, uint16} \\
119      &= clampToTargetSize(roundToNearestInteger( \frac{real\_value_{Single}}{scale_{Single}})_{sint32} + zero\_point_{uint8 \, or \, uint16})
120\end{align*}
121$$
122
123In the above, we assume that $$real\_value$$ is a Single, $$scale$$ is a Single,
124$$roundToNearestInteger$$ returns a signed 32-bit integer, and $$zero\_point$$
125is an unsigned 8-bit or 16-bit integer. Note that bit depth and number of fixed
126point values are indicative of common types on typical hardware but is not
127constrained to particular bit depths or a requirement that the entire range of
128an N-bit integer is used.
129
130#### Affine to real
131
132To convert an output tensor of affine elements represented by uint8
133or uint16 to a tensor of real-valued elements (usually represented with a
134floating point format, frequently Single precision), the following conversion
135can be performed:
136
137$$
138\begin{align*}
139re&al\_value_{Single} \\
140      &= roundToNearestFloat((affine\_value_{uint8 \, or \, uint16} - zero\_point_{uint8 \, or \, uint16})_{sint32})_{Single} * scale_{Single}
141\end{align*}
142$$
143
144In the above, we assume that the result of subtraction is in 32-bit signed
145integer format, and that $$roundToNearestFloat$$ returns a Single.
146
147#### Affine to fixed point
148
149When the affine and fixed point scales are the same, subtract the zero point
150from the affine value to get the equivalent fixed point value.
151
152$$
153scaled\_value = affine\_value_{non\mbox{-}negative} - zero\_point_{non\mbox{-}negative}
154$$
155
156#### Fixed point to affine
157
158When the affine and fixed point scales are the same, add the zero point to the
159fixed point value to get the equivalent affine value.
160
161$$
162affine\_value_{non\mbox{-}negative} = scaled\_value + zero\_point_{non\mbox{-}negative}
163$$
164
165## Usage within MLIR
166
167There are several components to the quantization system being developed within
168MLIR:
169
170*   *Quantization* dialect containing:
171
172    *   A family of [QuantizedTypes](#quantized-type) which represent the
173        mapping between *expressed* values (typically of a floating point
174        computer type) and *storage* values (typically of an integral computer
175        type).
176    *   [Type conversion ops](#quantized-type-conversion-ops) for converting
177        between types based on a QuantizedType and its *expressed* and *storage*
178        sub-types.
179    *   [Instrumentation ops](#instrumentation-and-constraint-ops) for assigning
180        instrumentation points within the computation where runtime statistics
181        may help guide the quantization process.
182
183*   [Integration with simulated quantization at training time](#integration-with-simulated-quantization-at-training-time)
184
185*   [TFLite native quantization](#tflite-native-quantization)
186
187    *   The TFLite op-set natively supports uniform-quantized variants.
188    *   Passes and tools exist to convert directly from the *TensorFlow* dialect
189        to the TFLite quantized operation set.
190
191*   [*FxpMath* dialect](#fxpmath-dialect) containing (experimental) generalized
192    representations of fixed-point math operations and conversions:
193
194    *   [Real math ops](#real-math-ops) representing common combinations of
195        arithmetic operations that closely match corresponding fixed-point math
196        concepts (as opposed to being spread across multiple ops as is typical
197        in source dialects).
198    *   [Fixed-point math ops](#fixed-point-math-ops) that for carrying out
199        computations on integers, as are typically needed by uniform
200        quantization schemes.
201    *   Passes to lower from real math operations to fixed-point math operations.
202
203*   [Solver tools](#solver-tools) which can (experimentally and generically
204    operate on computations expressed in the *FxpMath* dialect in order to
205    convert from floating point types to appropriate *QuantizedTypes*, allowing
206    the computation to be further lowered to integral math operations.
207
208Not every application of quantization will use all of these facilities. Specifically, the
209TensorFlow to TensorFlow Lite conversion uses the QuantizedTypes but has its own
210operations for type conversion and expression of the supporting math.
211
212## Quantization Dialect
213
214### Quantized type
215
216TODO : Flesh this section out.
217
218*   QuantizedType base class
219*   UniformQuantizedType
220
221### Quantized type conversion operations
222
223*   qcast : Convert from an expressed type to QuantizedType
224*   dcast : Convert from a QuantizedType to its expressed type
225*   scast : Convert between a QuantizedType and its storage type
226
227### Instrumentation and constraint operations
228
229*   const_fake_quant : Emulates the logic of the historic TensorFlow
230    fake_quant_with_min_max_args operation.
231*   stats_ref : Declares that statistics should be gathered at this point with a
232    unique key and made available to future passes of the solver.
233*   stats : Declares inline statistics (per layer and per axis) for the point in
234    the computation. stats_ref ops are generally converted to statistical operations once
235    trial runs have been performed.
236*   coupled_ref : Declares points in the computation to be coupled from a type
237    inference perspective based on a unique key.
238
239## Integration with simulated quantization at training time
240
241TensorFlow has historically used the
242[tf.quantization.fake_quant_\*](https://www.tensorflow.org/api_docs/python/tf/quantization/fake_quant_with_min_max_args)
243family of operations to simulate the effect of quantization at training time.
244
245As originally implemented, TensorFlow Lite was the primary user of such
246operations at inference time. When quantized inference was enabled, if every
247eligible tensor passed through an appropriate fake_quant node (the rules of
248which tensors can have fake_quant applied are somewhat involved), then
249TensorFlow Lite would use the attributes of the fake_quant operations to make a
250judgment about how to convert to use kernels from its quantized operations subset.
251
252In MLIR-based quantization, fake_quant_\* operationss are handled by converting them to
253a sequence of *qcast* (quantize) followed by *dcast* (dequantize) with an
254appropriate *UniformQuantizedType* as the target of the qcast operation.
255
256This allows subsequent compiler passes to preserve the knowledge that
257quantization was simulated in a certain way, while giving the compiler
258flexibility to move the casts as it simplifies the computation and converts it
259to a form based on integral arithmetic.
260
261This scheme also naturally allows computations that are *partially quantized*
262where the parts which could not be reduced to integral operationss are still carried out
263in floating point with appropriate conversions at the boundaries.
264
265## TFLite native quantization
266
267TODO : Flesh this out
268
269### General algorithm
270
2711.  Take input min/max information and set the ArrayInfo (which really is
272    InputOrOutputArrayInfo.
2731.  In LegalizeTF, convert ArrayInfo min/max to tf.Quantize and tf.Dequantize
274    nodes. (or tf.FakeQuant) Convert all constant FakeQuants to (tf.FQ -> tfl.Q
275    -> tfl.DQ).
2761.  Hardcode logic/propagation needs to happen here.
2771.  Run TF constant folding.
2781.  In PrepareTFL, convert all tf.FQ to (tfl.Q -> tfl.DQ).
2791.  Run quantization pass that take (tfl.DQ (for both input and weights) -> op
280    -> tfl.Q) and replaces with (op). Also replace (constant_float -> tfl.Q)
281    with (constant_quant).
282
283## FxpMath dialect
284
285### Real math operations
286
287Note that these all support explicit clamps, which allows for simple fusions and
288representation of some common sequences quantization-compatible math. Of
289addition, some support explicit biases, which are often represented as separate
290adds in source dialects.
291
292TODO: This operation set is still evolving and needs to be completed.
293
294*   RealBinaryOp
295    *   RealAddEwOp
296    *   RealSubEwOp
297    *   RealMulEwOp
298    *   RealDivEwOp
299*   RealUnaryOp
300    *   IDENTITY
301    *   TANH
302    *   SIGMOID
303    *   EXP
304    *   LOG
305    *   NEG
306    *   RSQRT
307    *   SIN
308    *   SQUARE
309    *   SQRT
310    *   CMPZ
311    *   CMPNZ
312    *   CMPLZ
313    *   CMPGZ
314
315### Fixed-point math operationss
316
317TODO: This operation set only has enough operations to lower a simple power-of-two
318RealAddEwOp.
319
320*   RoundingDivideByPotFxpOp
321*   SaturatingAddFxpOp
322
323## Solver tools
324
325Solver tools exist to analyze an MLIR-computation, expressed in either a
326supported source dialect or in the *real math ops* set and solve for appropriate
327QuantizedTypes that allow the computation to be lowered to integral math.
328
329These tools are an active area of work and may be expanded in the future to
330adjacent areas such as solving for transformations to other kinds of lower
331precision types (i.e. bfloat16 or fp16).
332
333Solver tools are expected to operate in several modes, depending on the
334computation and the training characteristics of the model:
335
336*   *Transform* : With all available information in the MLIR computation, infer
337    boundaries where the computation can be carried out with integral math and
338    change types accordingly to appropriate QuantizedTypes:
339
340    *   For passthrough ops which do not perform active math, change them to
341        operate directly on the storage type, converting in and out at the edges
342        via scast operations.
343    *   For operations that have the *Quantizable* trait, the type can be set directly.
344        This includes operations from the [real math ops set]{#real-math-ops}.
345    *   For others, encase them in appropriate dcast/qcast operations, presuming that
346        some follow-on pass will know what to do with them.
347
348*   *Instrument* : Most of the time, there are not sufficient implied
349    constraints within a computation to perform many transformations. For this
350    reason, the solver can insert instrumentation operations at points where additional
351    runtime statistics may yield solutions. It is expected that such
352    computations will be lowered as-is for execution, run over an appropriate
353    evaluation set, and statistics at each instrumentation point made available for a
354    future invocation of the solver.
355
356*   *Simplify* : A variety of passes and simplifications are applied once
357    QuantizedTypes are added in order to arrive at a computation that is
358    expressed in as much integral math, with the fewest number of casts as
359    possible.
360