Precision Preservation ---------------------- Though most software-based DNN implementations are FP32-based, many studies have already shown that lower precision is sufficient for inference. As such, NVDLA chooses to support INT8/INT16/FP16 as a trade-off between precision and performance/area. At the same time, NVDLA adopts technologies to keep the precision loss under control. Below, we give a diagram with an overview of NVDLA's precision-preservation architecture. .. _fig_image53_nvdla_precision: .. figure:: ias_image53_nvdla_precision.svg :align: center NVDLA precision-preservation architecture In total, there are four types of approaches to precision control in the NVDLA pipeline: - Convertor: The formula for a convertor in INT8 and INT16 is: :math:`y = saturation\_round{(x - offset_{int}) * scaling_{int} >> shifter_{uint}}` `offset`, `scaling`, and `shifter` are programmable registers to allow software to control the output dynamic range. Saturation is dependent on the number of output bits. For INT8 and INT16, `offset` and `scaling` are treated as signed integers, and the exact number of bits is depends on the input operands. `shifter` is a 5 bits unsigned integer (always specifying a right shift); the rounding method used after the shift is to “round half away from zero”. For FP16, the dynamic range that can be represented by FP16 representable is large, and so convertor and shifter logic is not implemented in hardware. The convertor is able to keep the best possible precision even if input data are not symmetric to 0, or dynamic range is not 2\ :sup:`N`; NVDLA uses it to convert internal precision (high) to external (low, typically INT8/INT16/FP16). - Truncate: Truncation is enabled for INT8/INT16 only. The formula for truncation is: :math:`y = saturate\_round (x[msb : lsb])` `lsb` is a programmable register; `msb` is defined as `lsb` + `output_bits`. Similarly to the convertor, the rounding method used in truncation is also “round half away from zero”. Truncation is used in NVDLA internal pipe where the :math:`y` has enough bits (compared to convertor case); it is the result of a trade-off between precision and area. - Shifter: The shifter exists to make sure that the bias can be added to convolution results; it has the formula :math:`y = saturate ( x << shifter )`. `shifter` is a programmable register. The saturate function depends on the number of output bits. - LUT: Look-up tables are used to deal with non-linear function in networks; these functions include sigmoid/tanh activation, or local response normalization as mentioned in `Data Formats `_. We use an innovative 2 level hybrid LUT to mimic those non-linear functions; for more detail, see :doc:`lut-programming`. FP16 error threshold ~~~~~~~~~~~~~~~~~~~~ We know that the output of computations on floating-point data are highly depending on computation order. As a result, when comparing the results of tests that are computed in FP16 space, we should allow some certain threshold of error when comparing NVDLA's output against reference. Below is a summary of the thresholds that are applied for each module. +-------------+---------------------------------------------------------------------------------+ | Sub-module | Threshold | +=============+=================================================================================+ | CC DC | :math:`fabs(a-b) \leq 2^{max\_exp-20} * R * S * C * 2 * scale + 2^{exp(a)-10}` | +-------------+---------------------------------------------------------------------------------+ | CC Winograd | :math:`fabs(a-b) \leq 2^{max\_exp-20} * 18 * C * 2 * scale + 2^{exp(a)-10}` | +-------------+---------------------------------------------------------------------------------+ | SDP | Bit-by-bit identical | +-------------+---------------------------------------------------------------------------------+ | CDP | :math:`(fabs(a-b) \leq 0.0001) \&\& (\frac{fabs(a-b)}{max\_value} \leq 0.001)` | +-------------+---------------------------------------------------------------------------------+ | PDP | :math:`(fabs(a-b) \leq 0.0001) \&\& (\frac{fabs(a-b)}{max\_value} \leq 0.001)` | +-------------+---------------------------------------------------------------------------------+ In the above table, note that: - :math:`exp(a)` in DC/Winograd is the operation to extract exponential field of an input: :math:`exp(a) = ((a>>10) \& 0x1f) - 15` . - :math:`max\_exp: in DC/Winograd is the maximum exponential value of :math:`wt*data` inside a convolution kernel. As data moves in a sliding window, max_exp is different element-by-element: .. math:: max\_exp=\max_{s=0...S-1\\r=0...R-1\\c=0...C-1}(exp(data_{r,s,c})\&(\sim\ 0x3)+exp(wt_{r,s,c})\&(\sim\ 0x3)) - **max_value** in CDP is the maximum value inside one square sum window. - **max_value** in PDP is the maximum value inside one pooling window. Convertor programming ~~~~~~~~~~~~~~~~~~~~~ Programming interface ^^^^^^^^^^^^^^^^^^^^^ As mentioned above, for a given convertor, there are 3 parameters: `offset`, `scaling` and `shifter`. Depending on the use cases, those parameters have different encodings: +-----------+----------+-----------+---------------+---------------+ | Parameter | INT->INT | INT->FP16 | FP16->INT | FP16->FP16 | +===========+==========+===========+===============+===============+ | Offset | INT32 | INT32 | Not supported | Not supported | +-----------+----------+-----------+---------------+---------------+ | Scaling | INT16 | INT16 | Not supported | Not supported | +-----------+----------+-----------+---------------+---------------+ | Shifter | UINT5 | UINT5 | Not supported | Not supported | +-----------+----------+-----------+---------------+---------------+ .. _convolution-convertors: Convolution convertors ^^^^^^^^^^^^^^^^^^^^^^ The following is a list of place where convertors or truncation might be used for a convolution layer (please refer to :numref:`fig_image53_nvdla_precision` about where’re the convertors in system): +-----------------+-----------------+-----------------+-----------------+ | Convertor | Functionality | Limitation | Owner | +=================+=================+=================+=================+ | cc_cvt | For image | INT8/INT16 only | HW | | | input, this | | | | | convertor | For FP16, | | | | responsible for | subtract mean | | | | mean | data only | | | | subtraction and | | | | | 8 bit | | | | | conversion; | | | | | | | | | | For feature | | | | | input, this | | | | | convertor won’t | | | | | be used | | | +-----------------+-----------------+-----------------+-----------------+ | wt_cvt | Convert weight | Offset is not | SW | | | data to | allowed | | | | INT8/16/FP16 | | | | | representable | | | +-----------------+-----------------+-----------------+-----------------+ | pra_trunc | Truncate the | Used for | HW | | | winograd | winograd mode | | | | pre-transformed | and | | | | results to | CSC.PROC_PRECIS | | | | INT8/16/FP16 | ION=INT8/INT16 | | | | representable | only | | +-----------------+-----------------+-----------------+-----------------+ | cc_out_trunc | Truncate the | CACC.PROC_PRECI | HW | | | data to | SION=INT8/INT16 | | | | INT32/FP32 | only | | | | before sending | | | | | to SDP | | | +-----------------+-----------------+-----------------+-----------------+ | bs_cvt | Convert bias | N/A | SW | | | data to | | | | | INT8/16/FP16 | | | | | representable | | | +-----------------+-----------------+-----------------+-----------------+ | bs_shifter | Shifter the | SDP.PROC_PRECIS | HW | | | input bias to | ION=INT8/INT16 | | | | make it addable | only | | | | with | | | | | convolution | | | | | pipeline | | | | | results | | | +-----------------+-----------------+-----------------+-----------------+ .. math:: \begin{equation}\begin{cases} SF_{bs}*2^{bs\_trunc}=\frac{SF_{in}*SF_{wt}}{2^{pra\_trunc+cc\_out\_trunc}},&\text{conv=winograd}\\ SF_{bs}*2^{bs\_trunc}=\frac{SF_{in}*SF_{wt}}{2^{cc\_out\_trunc}},&\text{conv=DC} \end{cases}\end{equation} .. In case of input data encoded with offset (*x’=(x-offset)*SF*), this offset should be carefully considered for cases below: - Padding: Convolution supports zero-padding, however, if the input encoded with “offset”, it means x=0 becomes x’=(0-offset)*SF=-offset*SF thus hardware should do “valued padding” instead of “zero padding”. Convolution has a register named as: PADDING_VALUE, this register should be set as –offset*SF for INT8/16 pipeline. However, for FP16, we assume there’s no offset thus PADDING_VALUE should be set as 0; - Activation: As discussed above, activation such as ReLU is a piece wise function: .. math:: \begin{equation}\begin{cases} y=x,&\text{x>0}\\ y=0,&\text{otherwise} \end{cases}\end{equation} 0 plays a important role to decide activation output, unfortunately, if “offset” is enabled on input convolution data, the “0” is no longer 0 in the encoded activation data: Let’s deduce the CC output (activation layer input) offset based on convolution definition: Given: :math:`In_{int}=SF_{in}*(In-Offset_{in}),\ Wt_{int}=SF_{wt}*Wt,\ CC_{FP}=\sum{In*Wt},` .. math:: \begin{align*} CC_{int} & = \frac{\sum{In_{int}*Wt_{int}}}{2^{pra\_trunc+cc\_out\_trunc}} \\ & = \frac{\sum{SF_{in}*(In-offset_{in})*SF_{wt}*Wt}}{2^{pra\_trunc+cc\_out\_trunc}} \\ & = \frac{SF_{in}*SF_{wt}*\sum{(In-offset_{in})*Wt}}{2^{pra\_trunc+cc\_out\_trunc}} \\ & = \frac{SF_{in}*SF_{wt}*(\sum{In*wt}-offset_{in}*\sum{Wt})}{2^{pra\_trunc+cc\_out\_trunc}} \\ & = \frac{SF_{in}*SF_{wt}*(CC_{FP}-offset_{in}*\sum{Wt})}{2^{pra\_trunc+cc\_out\_trunc}} \end{align*} (The truncate for activation/weight are merged to :math:`SF_{in}`, :math:`SF_{wt}` in formula above to simplify deduction) So, the CC output offset is: :math:`\frac{SF_{in}*SF_{wt}*offset_{in}*\Sigma{W_t}}{2^{pra\_trunc+cc\_out\_trunc}}`. Please be noticed: The formula above is assuming no quantization error, in practice, there’ll be quantization error on weight thus actual offset is :math:`\frac{SF_{in}*SF_{wt}*offset_{in}*\Sigma{W^{'}_t}}{2^{pra\_trunc+cc\_out\_trunc}}`. Where :math:`W^{'}_t` is the low precision version of weight which takes weight quantization error into consideration. :math:`\Sigma{W^{'}_t}` is different channel-by-channel which means :math:`\frac{SF_{in}*SF_{wt}*offset_{in}*\Sigma{W^{'}_t}}{2^{pra\_trunc+cc\_out\_trunc}}` also vary channel by channel thus per-channel operation has to be adopted to compensate the CC output offset. This compensation is done by ALU module in X1/X2/Y in SDP. SDP convertors ^^^^^^^^^^^^^^ SDP has kinds of use scenarios, table below lists how those use scenarios maps to SDP sub-modules (For the meaning of X/Y, please refer to :numref:`fig_image53_nvdla_precision`) +-------------------------------+------------+ | Use scenario | Sub-module | +===============================+============+ | Bias addition | X or Y | +-------------------------------+------------+ | Batch Normalization | X or Y | +-------------------------------+------------+ | Element-wise | X or Y | +-------------------------------+------------+ | Activation(ReLU/PReLU) | X or Y | +-------------------------------+------------+ | Activation(Sigmoid/TanH, etc) | Y | +-------------------------------+------------+ | Precision conversion | X or Y | +-------------------------------+------------+ Let’s review those cases one by one: .. bias-addition-1: Bias addition ''''''''''''' This already covered by `Convolution convertors`_ .. batch-normalization-1: Batch normalization ''''''''''''''''''' Here’s a list of convertor/shifters needed to realize batch normalization function in SDP: +-----------------+-----------------+-----------------+-----------------+ | Convertor | Functionality | Limitation | Owner | +=================+=================+=================+=================+ | bn_m_cvt | Convert the | N/A | SW | | | offline trained | | | | | batch | | | | | normalization | | | | | mean data to | | | | | INT8/16/FP16 | | | | | representable | | | +-----------------+-----------------+-----------------+-----------------+ | bn_m_shifter | Shift the | For | HW | | | bn_m_cvt | SDP.PROC_PRECIS | | | | converted | ION=INT8/INT16 | | | | values to have | only | | | | the same | | | | | scaling factor | | | | | as input | | | +-----------------+-----------------+-----------------+-----------------+ | bn_v_cvt | Convert the | Offset is not | SW | | | offline trained | allowed | | | | batch | | | | | normalization | | | | | 1/variance to | | | | | INT8/16/FP16 | | | | | representable | | | +-----------------+-----------------+-----------------+-----------------+ The input of batch normalization should be either from CONV/MC or previous pipeline stages thus we should assume :math:`O_{in}, SF_{in}` are applied on input. In order to make mean addable with input data, formula below should be satisfied: .. math:: SF_{in} = SF_{bs\_m\_cvt} * 2^{bn\_m\_shifter} Element wise '''''''''''' Here’s a list of convertor/shifters needed to related to element wise operation in SDP: +-----------------+-----------------+-----------------+-----------------+ | Convertor | Functionality | Limitation | Owner | +=================+=================+=================+=================+ | ew_cvt | The convertor | For | HW | | | applied on | SDP.PROC_PRECIS | | | | element-wise | ION=INT8/INT16 | | | | input, as | only | | | | element-wise | | | | | are cube-based, | | | | | the | | | | | element-wise | | | | | hardware layer | | | | | are the output | | | | | of upstream | | | | | hardware layers | | | +-----------------+-----------------+-----------------+-----------------+ | ew_inv_cvt | Align the | For | HW | | | offset/scaling | SDP.PROC_PRECIS | | | | factors to meet | ION=INT8/INT16 | | | | the requirement | only | | | | of different | | | | | element wise | | | | | operation(see | | | | | below). If the | | | | | requirement | | | | | already | | | | | satisfied, this | | | | | convertor can | | | | | be bypassed. | | | +-----------------+-----------------+-----------------+-----------------+ Since there might be 2 convertors applied on E-RDMA stream, if original input is x, the output from ew_inv_cvt is: .. math:: x'=\{(x-O_{ew\_cvt})*SF_{ew\_cvt}-O_{ew\_inv\_cvt}\}*SF_{ew\_inv\_cvt}=\{x-(O_{ew\_cvt}+\frac{O_{ew\_inv\_cvt}}{SF_{ew\_cvt}})\}*SF_{ew\_cvt}*SF_{ew\_inv\_cvt} In order to make element-wise acts as we supposed, the convertor parameter should be carefully configured based on different element-wise operation (Assume convertor parameter from BN module is: :math:`O_{in}, SF_{in}`): - MAX The offset/scaling applied on input stream and E-RDMA stream should be the same, which means: .. math:: O_{in}==O_{ew\_cvt}+\frac{O_{ew\_inv\_cvt}}{SF_{ew\_cvt}} .. math:: SF_{in}==SF_{ew\_cvt}*SF_{ew\_inv\_cvt} - SUM The scaling factor applied on both stream should be the same: .. math:: SF_{in} == SF_{ew\_cvt} * SF_{ew\_inv\_cvt} - PROD The offset applied on E-RDMA stream should be 0: .. math:: O_{ew\_cvt} + \frac{O_{ew\_inv\_cvt}}{SF_{ew\_cvt}} == 0 Activation (ReLU/PReLU) ''''''''''''''''''''''' The input offset of ReLU, PReLU already eliminated in ALU unit of X1/X2/Y thus the 0s in ReLU/PReLU is real “0”, so, we don’t need to worry modules; Activation (Sigmoid/TanH, etc.) ''''''''''''''''''''''''''''''' If complex activation function (e.g.: sigmoid or TanH) are used, LUT has to be used to mimic the curve of those functions. The LUT coverage has to be precisely matched with the input convertor parameter to make it acts as you want. Let’s use an example to explain this match process: suppose [100, 300] is the most interesting data range, user will program LUT (suppose we have 257 LUT entries) as: LUT[0]=f(100), LUT[1]=f(100+200/256) … LUT[256]=f(300) This means, if you want to get the correct LUT output, the LUT input has to be :math:`x^{'} = (x - O) * SF`, where O=100, SF=200/256 So, software has to carefully program the convertors before LUT to achieve this. Precision conversion '''''''''''''''''''' SDP supports various format conversions, when conversion from high precision to low (e.g.: INT16->INT8, FP16->INT16/8), a convertor is suggested to avoid the interested data range be rounding/saturated. The conversion can be done by any of the convertors in SDP pipeline (except ew_inv_cvt). CDP convertors ^^^^^^^^^^^^^^ CDP has convertors listed below: +-----------------+-----------------+------------------+-----------------+ | Convertor | Functionality | Limitation | Owner | +=================+=================+==================+=================+ | cdp_in_cvt | Convert the | For | HW | | | input data | CDP.INPUT_DATA_T | | | | compatible with | YPE=INT8/INT16 | | | | LUT | only | | | | requirement, | | | | | which means, | | | | | the output of | | | | | this convertor | | | | | should be: | | | | | x*2\ :sup:`N` | | | +-----------------+-----------------+------------------+-----------------+ | cdp_lut_cvt | Each LUT entry | No offset | SW | | | has 16bits (can | allowed | | | | be interpreted | | | | | as INT16 or | | | | | FP16 based on | | | | | pipeline), the | | | | | original f(x) | | | | | has to be | | | | | converted to | | | | | specified | | | | | format to keep | | | | | a high | | | | | precision | | | +-----------------+-----------------+------------------+-----------------+ | cdp_out_cvt | Convert the | For | HW | | | results to | CDP.INPUT_DATA_T | | | | INT8/16/FP16 | YPE=INT8/INT16 | | | | before output | only | | | | to external | | | +-----------------+-----------------+------------------+-----------------+ Suppose the CDP input has, in order to make LUT input has the form of x*2\ :sup:`M`, cdp_in_cvt has to be programmed as: .. math:: O_{cdp\_in\_cvt} = -O_{in} * SF_{in} .. math:: SF_{cdp\_in\_cvt} = \frac{2^M}{SF_{in}} Value M should be selected by precision study. Suppose CDP output is encoded as :math:`O_{out}, SF_{out}`, cdp_lut_cvt and cdp_out_cvt has to be programmed as: .. math:: O_{out} == \frac{O_{cdp\_out\_cvt}}{SF_{cdp\_lut\_cvt} * 2^M} .. math:: SF_{out} == SF_{cdp\_lut\_cvt} * SDP_{cdp\_out\_cvt} * 2^M PDP convertors ^^^^^^^^^^^^^^ There’s no convertor instanced in PDP. But be noticed that the PDP padding value is intended to compensate the input offset, for FP16 pipe, they’re ignored as we assume there’s no offset for FP16 pipe; .. _convertor-statistics: Convertor statistics ^^^^^^^^^^^^^^^^^^^^ NVDLA implemented counters to evaluate number of samples overflowed during convertor. The overflow is defined as: .. math:: INT32: x < -2147483648 || x > 2147483647 .. math:: INT16: x < -32768 || x > 32767 .. math:: INT8: x < -128 || x > 127 .. math:: FP16: fabs(x) >= 65504 Here’s a list of saturation counters in NVDLA pipeline: +---------------------------+--------------------------------------+ | Register | Valid condition | +===========================+======================================+ | CACC. D_OUT_SATURATION | Always enabled | +---------------------------+--------------------------------------+ | SDP.D_PERF_OUT_SATURATION | PERF_SAT_EN=YES && | | | | | | PROC_PRECISION== OUT_PRECISION==FP16 | +---------------------------+--------------------------------------+ | CDP.D_OUT_SATURATION | Always enabled | +---------------------------+--------------------------------------+