Quantization.md - OpenGrok cross reference for /llvm-project/mlir/docs/Quantization.md

Lines Matching +full:zero +full:- +full:point
5 narrow scope of techniques in use to enable conversion of floating-point
7 for inference, as has historically been supported by low-bit depth inference
20 express fixed point and affine transformations via uniformly spaced point on the
25 *   *per-layer* : Applying to every value within the target type.
26 *   *per-axis* (also called *per-channel*) : Applying individually to each index
29 ### Fixed point values
31 [Fixed point](https://en.wikipedia.org/wiki/Fixed-point_arithmetic) values are a
38 scaled values. For example, if the scale is $ \pi $, then fixed point values
41 point value with a given $ scale $ is $ \frac{scale}{2} $. Continuing the
57 [adding a Real-valued *zero point*, to a scaled value](https://en.wikipedia.org/wiki/Affine_transfo…
58 Alternatively (and equivalently), subtracting a zero point from an affine value results in a
61 $$ real\\_value = scaled\\_value * scale = (affine\\_value - zero\\_point) * scale $$
66 [converted](#affine-to-fixed-point) to the equivalent scaled values.
71 symmetric around the real zero. We also make the assumption that the real zero
78 In order to exactly represent the real zero with an integral-valued affine
79 value, the zero point must be an integer between the minimum and maximum affine
81 unsigned integer, we have: $ 0 \leq zero\\_point \leq 255 $. This is important,
82 because in convolution-like operations of deep neural networks, we frequently
83 need to zero-pad inputs and outputs, so zero must be exactly representable, or
88 Real values, fixed point values, and affine values relate through the following
91 $$ real\\_value = scaled\\_value * scale = (affine\\_value - zero\\_point) * scale $$
96 (this applies to both cases: storing using floating point and storing using
97 fixed point). Note that a full discussion of rounding behavior is outside the
102 ### Converting between real and fixed point or affine
104 To convert a real value to a fixed point value, we must know the scale. To
105 convert a real value to an affine value, we must know the scale and the zero point.
109 To convert an input tensor of real-valued elements (usually represented by a
110 floating point format, frequently
111 [Single precision](https://en.wikipedia.org/wiki/Single-precision_floating-point_format))
112 to a tensor of affine elements represented by an integral type (e.g. 8-bit
119   &= clampToTargetSize(roundToNearestInteger( \frac{real\\\_value}{scale}) + zero\\\_point \\\\
125 - `real_value`: Single
126 - `scale`: Single
127 - `roundToNearestInteger`: returns a 32-bit integer
128 - `zero_point`: 8-bit or 16-bit integer
129 - `affine_value`: 8-bit or 16-bit integer
131 Note that bit depth and number of fixed point values are indicative
134 N-bit integer is used.
139 or uint16 to a tensor of real-valued elements (usually represented with a
140 floating point format, frequently Single precision), the following conversion
146       &= roundToNearestFloat(affine\\\_value - zero\\\_point) * scale
152 - `real_value`: Single
153 - `scale`: Single
154 - `affine_value`: 8-bit or 16-bit integer
155 - `zero_point`: 8-bit or 16-bit integer
156 - `roundToNearestFloat`: returns a Single
157 - `-` (subtraction): returns a 32-bit signed integer
159 #### Affine to fixed point
161 When the affine and fixed point scales are the same, subtract the zero point
162 from the affine value to get the equivalent fixed point value.
166   scaled\\\_value = affine\\\_value_{non\mbox{-}negative} - zero\\\_point_{non\mbox{-}negative}
170 #### Fixed point to affine
172 When the affine and fixed point scales are the same, add the zero point to the
173 fixed point value to get the equivalent affine value.
177   affine\\\_value_{non\mbox{-}negative} = scaled\\\_value + zero\\\_point_{non\mbox{-}negative}
188     *   A family of [QuantizedTypes](#quantized-type) which represent the
189         mapping between *expressed* values (typically of a floating point
192     *   [Type conversion ops](#quantized-type-conversion-operations) for converting
194         sub-types.
195     *   [Instrumentation ops](#instrumentation-and-constraint-operations) for assigning
199 …th simulated quantization at training time](#integration-with-simulated-quantization-at-training-t…
201 *   [TFLite native quantization](#tflite-native-quantization)
203     *   The TFLite op-set natively supports uniform-quantized variants.
230 *   stats_ref : Declares that statistics should be gathered at this point with a
232 *   stats : Declares inline statistics (per layer and per axis) for the point in
251 In MLIR-based quantization, fake_quant_\* operations are handled by converting them to
262 in floating point with appropriate conversions at the boundaries.
273     nodes. (or tf.FakeQuant) Convert all constant FakeQuants to (tf.FQ -> tfl.Q
274     -> tfl.DQ).
277 1.  In PrepareTFL, convert all tf.FQ to (tfl.Q -> tfl.DQ).
278 1.  Run quantization pass that take (tfl.DQ (for both input and weights) -> op
279     -> tfl.Q) and replaces with (op). Also replace (constant_float -> tfl.Q)