In-memory data formats ====================== Overview -------- DLA engine supports various types of CNN layers like convolution layer, pooling, ReLU, LRN, etc. To improve the performance DLA engine also applies some options like Winograd, weight compression and multi-batch mode. To support these layers DLA engine uses specified input and output data formats. There are two main types of input formats. They are weight data and activation data. The weight formats include below branches: - weight for direct convolution - weight for image input - weight for Winograd convolution Two options for weight formats: - channel post-extension for image input mode - sparse compression The activation formats supported by DLA engine includes: - feature data format - pixel format (ROI input) The output formats supported by DLA engine includes: - feature data format Besides, DLA engine will fetch below auxiliary formats from external memory - bias data - PReLU data - batch-normalization data - element-wise data Channel extension refers to a set of mapping rule for both weight data and activation data to fit with accelerator. It includes: - Channel extension for Winograd convolution - Channel pre-extension for image input mode - Channel post-extension for image input mode The channel post-extension for image input mode is an option for performance. Other two are mandatory for their working mode. HW handles all channel extension on feature/pixel data, while SW shall do channel extension to weight data accordingly. All data formats should be mapping in memory with rules. Precision Type ~~~~~~~~~~~~~~ NVDLA engine pipeline support three types of data precision. They are int8, int16 and fp16. For int8, one element of data refers to an 8-bit signed integer. For int16, one element refers to a 16-bit signed integer. For fp16, one element refers to a 16-bit floating point data, which is also named as half-precision floating-point format. All input feature data should belong to one of three precision types. And all image input data will be converted to one precision type before calculation. For example, DLA engine can take a T_R10G10B10A2 image as input (for first layer) and convert the component to int8, int16 or fp16. Precision Conversion ^^^^^^^^^^^^^^^^^^^^ NVDLA engine supports dynamical precision conversion. There are some rules: - NVDLA convolution pipeline supports precision conversion for image input mode only. - Direct convolution (DC) mode and Winograd convolution mode do not support precision conversion - For image input mode (please see section 6.1.1.4), pipeline allows conversion from integer to all 3 types. Floating point images can only be converted to fp16. - Batch-normalization and element-wise layer (implemented in SDP) support free conversion of int16 <-> fp16 and int8 <-> int16 for DC mode only. - LRN layer (implemented in CDP) does not support any precision conversion - Pooling layer (implemented in PDP) does not support any precision conversion. Here is the summary: .. table:: Precision conversion for convolutional layer :name: tab_precision_conversion_conv +-----------------+-----------------+-----------------+-----------------+ | Configured | Configured | Real precision | Corresponding | | input format | output | in pipeline | weight | | | precision | | precision | +=================+=================+=================+=================+ | image input | int8 | int8 | int8 | | | | | | | (uint8/ | | | | | | | | | | int16/uint16) | | | | +-----------------+-----------------+-----------------+-----------------+ | | int16 | int16 | int16 | +-----------------+-----------------+-----------------+-----------------+ | | fp16 | fp16 | fp16 | +-----------------+-----------------+-----------------+-----------------+ | image input | int8 | **Invalid | **Invalid | | | | case** | case** | | (fp16) | | | | +-----------------+-----------------+-----------------+-----------------+ | | int16 | **Invalid | **Invalid | | | | case** | case** | +-----------------+-----------------+-----------------+-----------------+ | | fp16 | fp16 | fp16 | +-----------------+-----------------+-----------------+-----------------+ | int8 feature | int8 | int8 | int8 | | data | | | | +-----------------+-----------------+-----------------+-----------------+ | | int16 | **Invalid | **Invalid | | | | case** | case** | +-----------------+-----------------+-----------------+-----------------+ | | fp16 | **Invalid | **Invalid | | | | case** | case** | +-----------------+-----------------+-----------------+-----------------+ | int16 feature | int8 | **Invalid | **Invalid | | data | | case** | case** | +-----------------+-----------------+-----------------+-----------------+ | | int16 | int16 | int16 | +-----------------+-----------------+-----------------+-----------------+ | | fp16 | **Invalid | **Invalid | | | | case** | case** | +-----------------+-----------------+-----------------+-----------------+ | fp16 feature | int8 | **Invalid | **Invalid | | data | | case** | case** | +-----------------+-----------------+-----------------+-----------------+ | | int16 | **Invalid | **Invalid | | | | case** | case** | +-----------------+-----------------+-----------------+-----------------+ | | fp16 | fp16 | fp16 | +-----------------+-----------------+-----------------+-----------------+ .. table:: Precision conversion for SDP layer (offline mode) :name: tab_precision_conversion_sdp +--------------------+-----------------------------+----------------------------+ | Configured | Configured output precision | Real precision in pipeline | | input format | | | +====================+=============================+============================+ | int8 feature data | int8 | int32 | +--------------------+-----------------------------+----------------------------+ | | int16 | int32 | +--------------------+-----------------------------+----------------------------+ | | fp16 | **Invalid case** | +--------------------+-----------------------------+----------------------------+ | int16 feature data | int8 | int32 | +--------------------+-----------------------------+----------------------------+ | | int16 | int32 | +--------------------+-----------------------------+----------------------------+ | | fp16 | int32 | +--------------------+-----------------------------+----------------------------+ | fp16 feature data | int8 | **Invalid case** | +--------------------+-----------------------------+----------------------------+ | | int16 | fp32 | +--------------------+-----------------------------+----------------------------+ | | fp16 | fp32 | +--------------------+-----------------------------+----------------------------+ Table 6‑3 precision conversion for LRN layer .. table:: Precision conversion for LRN layer :name: tab_precision_conversion_lrn +--------------------+-----------------------------+----------------------------+ | Configured | Configured output precision | Real precision in pipeline | | input format | | | +====================+=============================+============================+ | int8 feature data | int8 | int8 | +--------------------+-----------------------------+----------------------------+ | | int16 | **Invalid case** | +--------------------+-----------------------------+----------------------------+ | | fp16 | **Invalid case** | +--------------------+-----------------------------+----------------------------+ | int16 feature data | int8 | **Invalid case** | +--------------------+-----------------------------+----------------------------+ | | int16 | int16 | +--------------------+-----------------------------+----------------------------+ | | fp16 | **Invalid case** | +--------------------+-----------------------------+----------------------------+ | fp16 feature data | int8 | **Invalid case** | +--------------------+-----------------------------+----------------------------+ | | int16 | **Invalid case** | +--------------------+-----------------------------+----------------------------+ | | fp16 | fp16 | +--------------------+-----------------------------+----------------------------+ .. table:: Precision conversion for pooling layer :name: tab_precision_conversion_poolong +--------------------+-----------------------------+----------------------------+ | Configured | Configured output precision | Real precision in pipeline | | input format | | | +====================+=============================+============================+ | int8 feature data | int8 | int8 | +--------------------+-----------------------------+----------------------------+ | | int16 | **Invalid case** | +--------------------+-----------------------------+----------------------------+ | | int16 | **Invalid case** | +--------------------+-----------------------------+----------------------------+ | int16 feature data | int8 | **Invalid case** | +--------------------+-----------------------------+----------------------------+ | | int16 | int16 | +--------------------+-----------------------------+----------------------------+ | | fp16 | **Invalid case** | +--------------------+-----------------------------+----------------------------+ | fp16 feature data | int8 | **Invalid case** | +--------------------+-----------------------------+----------------------------+ | | int16 | **Invalid case** | +--------------------+-----------------------------+----------------------------+ | | fp16 | fp16 | +--------------------+-----------------------------+----------------------------+ For pixel formats, the conversion to int8/int16/fp16 follows the equation below. .. math:: d_{int8} = truncate2int8\left( \left( d_{\text{pixel}} - offset \right)*SF \right) .. math:: d_{int16} = truncate2int16\left( \left( d_{\text{pixel}} - offset \right)*SF \right) .. math:: d_{fp16} = int2fp\left( \left( d_{\text{pixel}} - offset \right)*SF \right) Equation 1 pixel precision conversion Here *SF* refers to scaling factor, *offset* refers to offset value. They are both given by programmable register fields. For conversion between int16 and int8, the equations are: .. math:: d_{int8} = truncate2int8\left( \left( d_{int16} - offset \right)*SF \right) .. math:: d_{int16} = truncate2int16\left( \left( d_{int8} - offest \right)*SF \right) Equation 3 precision conversion between int8 and int16 **The CDMA and SDP convert precision individually.** When working in on-flying mode, SDP takes precision of convolution pipeline output as input precision then do another precision conversion, but the input precision and output precision should have the same bit-depth. FP16 Supporting ^^^^^^^^^^^^^^^^ This section describes NVDLA how to support fp16 in data-path. - Infinity NVDLA treats infinity value as different normalized value module by module: +----------------------+-------------------------+ | Sub-module | INF converted values | +======================+=========================+ | Convolution pipeline | +/-65536 (DC/IMG) | | | | | | +/-65504 (Winograd) | +----------------------+-------------------------+ | SDP | +/-3.40282e+38 | +----------------------+-------------------------+ | CDP | +/-4292870144 | +----------------------+-------------------------+ | PDP | +/-4292870144 (For AVE) | | | | | | INF (For Max/Min) | +----------------------+-------------------------+ There won’t be any INF output from any NVDLA sub-module, if saturation happens, NVDLA will output the maximum representable (+/-65504 for FP16, 32767/-32768 for INT16, 127/-128 for INT8). - NaN NVDLA won’t generate NaN since no infinity value involves in any operation. But it supports NaN propagation. If input data have NaN, any result related to NaN operand will be NaN (mantissa propagation behavior is undefined). NVDLA provides a register field to flush NaN to Zeros. If the register is set, all input NaNs are treated as zero value in float point data-path and output data cube doesn’t have any NaN. Otherwise input NaNs propagate to output. NVDLA also provide input/output NaN counting registers that summarize total NaN number in input/output data cube. The counting registers are updated when layer is done. When done interrupts arrives, FW can poll NaN counting registers to figure out whether input/output data cubes have any NaN value. - Denormalized value NVDLA supports denormalized value for both input and output. The dealing of denormalized value is completely following the requirement of IEEE754 standard. Actually, NVDLA internal float point data-path often provide fp17/fp32 value for better precision. These fp17 and fp32 format doesn’t support denormalized value during calculation. Even though these formats have better precision than fp16 with denormalized value. Before writing back to memory, fp17/fp32 will convert to fp16 with denormalized value. - Rounding NVDLA supports Rounding to Nearest (or RN) in calculation except overflow case. If the result is exceeding maximal normal value, it will be clipped to max normalized value. .. _feature_data_format: Feature Data Format ------------------- DLA engine maintains a private data format for all supported HW-layers. The data format is called feature data format. This format is only generated by DLA engine itself. All elements of feature data for one layer are organized as a 3D data cube. Three dimensions are width (W), height (H) and channel size (C). The memory mapping rules are: - Adding data into end of channel if the original data is not 32byte aligned in C direction. - The attached data can be any value except NaN when it’s fp16. - Split the data cube into 1x1x32byte small atom cubes. - Reordering atom cubes in by progressively scanning the data cube. Scanning order: W (line) -> H (height) -> C (channel). - Map all atom cubes into memory by scanning sequence. - All atom cubes in the same line are mapped compactly. - Atom cube mapping at line boundary and/or surface boundary can be either adjacently or incompactly. But they are always 32-byte aligned. - In conclusion, mapping in memory follows pitch linear format. The order is C’ (32byte) -> W -> H -> C (surfaces). Here C’ changes fastest and C changes slowest. :numref:`fig_packed_feature_diagram` is a case of feature data that all small cubes are mapped compactly. This is called packed feature data. If the line or surface of small cubes is not mapped compactly, it is called unpacked. See :numref:`fig_unpacked_feature_diagram`. .. note:: Line stride and surface stride of feature data shall always align to 32bytes. Start address has same alignment as well. This is mandatory requirement. .. image1 .. _fig_packed_feature_diagram: .. figure:: format_packed_feature_diagram.svg :align: center Packed feature data .. image2 .. _fig_unpacked_feature_diagram: .. figure:: format_unpacked_feature_diagram.svg :align: center Unpacked feature data If a 1x1xC feature data cube maps as surface-packed, NVDLA can treat it like (C/32) x1x 32 cube to save bandwidth. Mapping of feature data cube is done by NVDLA core logic. Falcon does not involve in mapping procedures. Pixel Format ------------ DLA engine supports pixel data for ROI. The pixel data comes from a part or a whole image. The pixel formats are listed in table :numref:`tab_pixel_formats`. When NVDLA takes image as input data, there are some limits of configuration. - Channel size. The valid channel size highly depends on each format. Please see table :numref:`tab_pixel_formats`. - Input precision. The input precision highly depends on pixel each format. Please see table :numref:`tab_pixel_formats`. DMA logic will turn unsigned integer value to signed integer value automatically. - **Both start address and line stride of pitch linear shall aligned to 32 bytes. This is mandatory requirement.** - It may have redundant data between 32-byte aligned address and first element. NVDLA use x offset to indicate how many redundant data are. The unit of offset is pixel. .. table:: Pixel formats and valid setting :name: tab_pixel_formats +-------------+-------------+-------------+-------------+-------------+ | Format Name | # of planar | Valid | Valid input | Valid X | | | | channel | precision | offset | | | | size | setting | range | | | | setting | | | +=============+=============+=============+=============+=============+ | T_R8 | 1 | 1 | int8 | 0~31 | +-------------+-------------+-------------+-------------+-------------+ | T_R10 | 1 | 1 | int16 | 0~15 | +-------------+-------------+-------------+-------------+-------------+ | T_R12 | 1 | 1 | int16 | 0~15 | +-------------+-------------+-------------+-------------+-------------+ | T_R16 | 1 | 1 | int16 | 0~15 | +-------------+-------------+-------------+-------------+-------------+ | T_R16_I | 1 | 1 | int16 | 0~15 | +-------------+-------------+-------------+-------------+-------------+ | T_R16_F | 1 | 1 | int16 | 0~15 | +-------------+-------------+-------------+-------------+-------------+ | T_A16B16G16 | 1 | 4 | int16 | 0~3 | | R16 | | | | | +-------------+-------------+-------------+-------------+-------------+ | T_X16B16G16 | 1 | 4 | int16 | 0~3 | | R16 | | | | | +-------------+-------------+-------------+-------------+-------------+ | T_A16B16G16 | 1 | 4 | fp16 | 0~3 | | R16_F | | | | | +-------------+-------------+-------------+-------------+-------------+ | T_A16Y16U16 | 1 | 4 | int16 | 0~3 | | V16 | | | | | +-------------+-------------+-------------+-------------+-------------+ | T_V16U16Y16 | 1 | 4 | int16 | 0~3 | | A16 | | | | | +-------------+-------------+-------------+-------------+-------------+ | T_A16Y16U16 | 1 | 4 | fp16 | 0~3 | | V16_F | | | | | +-------------+-------------+-------------+-------------+-------------+ | T_A8B8G8R8 | 1 | 4 | int8 | 0~7 | +-------------+-------------+-------------+-------------+-------------+ | T_A8R8G8B8 | 1 | 4 | int8 | 0~7 | +-------------+-------------+-------------+-------------+-------------+ | T_B8G8R8A8 | 1 | 4 | int8 | 0~7 | +-------------+-------------+-------------+-------------+-------------+ | T_R8G8B8A8 | 1 | 4 | int8 | 0~7 | +-------------+-------------+-------------+-------------+-------------+ | T_X8B8G8R8 | 1 | 4 | int8 | 0~7 | +-------------+-------------+-------------+-------------+-------------+ | T_X8R8G8B8 | 1 | 4 | int8 | 0~7 | +-------------+-------------+-------------+-------------+-------------+ | T_B8G8R8X8 | 1 | 4 | int8 | 0~7 | +-------------+-------------+-------------+-------------+-------------+ | T_R8G8B8X8 | 1 | 4 | int8 | 0~7 | +-------------+-------------+-------------+-------------+-------------+ | T_A2B10G10R | 1 | 4 | int16 | 0~7 | | 10 | | | | | +-------------+-------------+-------------+-------------+-------------+ | T_A2R10G10B | 1 | 4 | int16 | 0~7 | | 10 | | | | | +-------------+-------------+-------------+-------------+-------------+ | T_B10G10R10 | 1 | 4 | int16 | 0~7 | | A2 | | | | | +-------------+-------------+-------------+-------------+-------------+ | T_R10G10B10 | 1 | 4 | int16 | 0~7 | | A2 | | | | | +-------------+-------------+-------------+-------------+-------------+ | T_A2Y10U10V | 1 | 4 | int16 | 0~7 | | 10 | | | | | +-------------+-------------+-------------+-------------+-------------+ | T_V10U10Y10 | 1 | 4 | int16 | 0~7 | | A2 | | | | | +-------------+-------------+-------------+-------------+-------------+ | T_A8Y8U8V8 | 1 | 4 | int8 | 0~7 | +-------------+-------------+-------------+-------------+-------------+ | T_V8U8Y8A8 | 1 | 4 | int8 | 0~7 | +-------------+-------------+-------------+-------------+-------------+ | T_Y8___U8V8 | 2 | 3 | int8 | 0~31 | | _N444 | | | | | +-------------+-------------+-------------+-------------+-------------+ | T_Y8___V8U8 | 2 | 3 | int8 | 0~31 | | _N444 | | | | | +-------------+-------------+-------------+-------------+-------------+ | T_Y10___U10 | 2 | 3 | int16 | 0~15 | | V10_N444 | | | | | +-------------+-------------+-------------+-------------+-------------+ | T_Y10___V10 | 2 | 3 | int16 | 0~15 | | U10_N444 | | | | | +-------------+-------------+-------------+-------------+-------------+ | T_Y12___U12 | 2 | 3 | int16 | 0~15 | | V12_N444 | | | | | +-------------+-------------+-------------+-------------+-------------+ | T_Y12___V12 | 2 | 3 | int16 | 0~15 | | U12_N444 | | | | | +-------------+-------------+-------------+-------------+-------------+ | T_Y16___U16 | 2 | 3 | int16 | 0~15 | | V16_N444 | | | | | +-------------+-------------+-------------+-------------+-------------+ | T_Y16___V16 | 2 | 3 | int16 | 0~15 | | U16_N444 | | | | | +-------------+-------------+-------------+-------------+-------------+ .. _weight_format: Weight Format ------------- Unlike pixel data or feature data, weight data are generated long before convolution operation. And DLA engine never changes them during operation. Software should map weight data with property rules to fit with the calculation sequence in DLA. The original weight data has 4 dimensions: width, height, channel and number of kernels. They can construct as a group of 3D data cubes. One data cube is called a kernel. See :numref:`fig_original_weight_data`. DLA engine support 4 types of weight data. They are weight for direct convolution, weight for Winograd convolution, weight for image input and weight for deconvolution. There are two options for weight to improve DLA performance: sparse compression and channel post-extension. DLA engine support 4 basic formats of weight data for different operation mode: - weight for direct convolution - weight for image input - weight for deconvolution - weight for Winograd convolution There are some mandatory requirements for some formats: - channel pre-extension for image input - channel extension for Winograd - Set split for deconvolution And two options for weight formats: - channel post-extension - sparse compressing .. table:: Weight formats and options :name: tab_weight_formats +--------------------------+---------------------------+-----------------------+ | Weight types | Sparse compression option | Post-extension option | +==========================+===========================+=======================+ | Weight for DC | Support | **NOT support** | +--------------------------+---------------------------+-----------------------+ | Weight for Winograd | Support | **NOT support** | +--------------------------+---------------------------+-----------------------+ | Weight for image input | Support | Support | +--------------------------+---------------------------+-----------------------+ | Weight for deconvolution | Support | **NOT support** | +--------------------------+---------------------------+-----------------------+ .. image3 .. _fig_original_weight_data: .. figure:: format_original_weight_data.svg :scale: 55% :align: center Original weight data Basic Weight for Direct Convolution ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Basic weight for direct convolution is the most basic and common weight format. Other weight formats are all extended from this format. The mapping rules of uncompressed weight for direct convolution are: - Distribute the kernels into groups. For int16 and fp16 weight, one group has 16 kernels. For int8, one group has 32 kernels. Last group can have fewer kernels. - Divide each kernel to 1x1x64-element small cubes. For int16/fp16 the small cube is 128 bytes each; and for int8 the small cube is 64 bytes each. Do not append 0 if channel size is not divisible by 128/64. - After division, all weights are stored in 1x1xC’ small cubes, where C’ is no more than 128 bytes. - Scan the 1x1xC’ small cubes in a group with C’->K->W->H->C sequence. Here C’ changes fastest and C changes slowest. And map them compactly as scanning sequence. - Map the weight groups compactly. Do not append any 0s between group boundaries. - Append 0s at end of all mapped weight for 128-byte alignment. Diagram below shows how a group of 3x3x192Byte kernel maps for direct convolution. .. image4 .. _fig_dc_weight_mapping: .. figure:: format_dc_weight_mapping.svg :scale: 55% :align: center Weight mapping for direct convolution inside one group Basic Weight for image input ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Weight mapping for image input is like weight for direct convolution. The main difference is that image weight needs an additional channel extension step ahead of mapping steps for direct convolution weight. The channel pre-extension for image weight is a mandatory requirement, while channel post-extension is an option to improve performance. .. note:: Channel pre-extension for image weight is different from channel extension for Winograd convolution. The key idea of per-extension is to turn all weights in same line to a single channel. :numref:`fig_dc_channel_extension_for_image_for_weight` is a case for an int16 image input whose channel size is 3. .. image5 .. _fig_dc_channel_extension_for_image_for_weight: .. figure:: format_dc_channel_extension_for_image_for_weight.svg :scale: 55% :align: center Channel extension for image weight Channel pre-extension is the first step for image weight. Then all extended kernels follow the same steps of weight for direct convolution. That is, SW still need to do group and channel distribution after channel extension. Basic Weight for Winograd Convolution ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The memory mapping of Winograd weight is very different from direct convolution. There are two phases to process the weights. Phase 1 is to do channel extension and conversion for each kernel. Phase 2 is to group the kernels and map small cubes in memory. Steps of phase 1: - Divide kernels to 1x1x32Byte small cubes. If the channel size is not divisible by 32, append 0s. - Do channel extension in if convolution stride is not 1. The new width and height of a kernel should be 3 after extension. - Convert the kernel from 3x3xC cube to a 4x4xC cube. The equation is GWGT. Here W is each 4x4x1 of weight cube, G is a 4 x 3 matrix and GT is transpose matrix. - During conversion, a scaling factor may involve. Please see the Winograd convolution documentation for reference. - The width and height of a kernel should be 4 after conversion. .. math:: G = \begin{bmatrix} 1 & 0 & 0 \\ 0.5 & 0.5 & 0.5 \\ 0.5 & - 0.5 & 0.5 \\ 0 & 0 & 1 \\ \end{bmatrix} Matrix for weight transfer for Winograd Steps of phase 2: - Distribute the converted kernels into groups. For int16 and fp16 weight, one group has 16 kernels. For int8, one group has 32 kernels. - Divide converted kernels to 4x4x4 elements small cubes. For int16/fp16 small cube is 128 bytes each. For int8 small cube is 64 bytes each. The channel size should always divisible by 4. - Scan the 4x4x4 elements small cubes in a group with K->C sequence. Take int16 for example, the scan order is small cube 0 of K0, small cube 0 of K1, small cube 0 of K2, …, small cube 0 of K15, small cube 1 of K0, small cube 1 of K1, …, small cube 1 of K15, …, small cube N of K15. - Maps the 4x4x4 elements small cubes closely with scanning order - Maps the weight groups one by one closely The phase 2 is similar to weight for direct convolution except the small cube size is 4x4x4 elements. Figure below shows how to do channel extension to one kernel and map the data. .. image6 .. _fig_channel_extension_and_conversion_for_wingorad: .. figure:: format_channel_extension_and_conversion_for_wingorad.svg :align: center Channel extension and conversion for Winograd Weight Channel Post-extension for image input ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Weight channel post-extension is an option to enhance MAC efficiency when channel size is less than 32. It is available for image input mode only. Key idea of channel post-extension is to combine two neighbor lines to saving the efficiency. It allows two-line (C<=32) or four-line (C<=16) combination. 1, 2 and 4 parameters are available. If this option is enabled, NVDLA manage to post-extend input feature (or image) data in CSC sub units. And SW needs to adjust weight mapping order. The channel post-extension is done after pre-extension. Figure below shows one case which parameter is 2. .. image7 .. _fig_weight_channel_post_extension_2: .. figure:: format_weight_channel_post_extension_2.svg :align: center Weight channel post-extension, parameter = 2 Flow of pre-extension, post-extension, mapping and compression option for image weight: - Do pre-extension - Do post-extension - Remap weight data - Do weight compression. Some tips for post-extension: - Channel post-extension cannot be used in Winograd convolution - Channel post-extension only support 2-line and 4-line. - If weight height is not divisible by 2 (2-lines) or 4 (4-lines), do NOT append 0s. This is unlike channel extension for Winograd. Sparse Compression option ~~~~~~~~~~~~~~~~~~~~~~~~~ To reduce the bandwidth and power consumption on memory interface, NVDLA engine support weight sparse compression option. All four weight formats can support sparse compression. This option requires additional steps after basic mapping and post-extension option. Sparse algorithm uses one-bit tag to indicate a weight element is zero or not. Bit tags of one kernel group compose a weight mask bit group, or WMB. WMBs reside in a dedicate memory surface. Since 0 values are marked by bit tags (assign 0 to corresponding bit), they can be removed from original weight memory surface. A third memory surface recodes remaining byte number of each kernel group (WGS). The steps of weight compression are: - Always use 1 bit to indicate 1 element of weight. For int16 and fp16, 1 bit represents 2 bytes of weight data; for int8, 1 bit represents 1 byte of weight data. - Compress weight group by group. Assembly of bits for one weight group is called WMB. The bits in WMB are stored as little-endian. - Align WMB surface to 128-byte by adding 0 bits in the end - Remove all 0 weights in original surface and pack them compactly. - Align compressed weight surface to 128-byte by adding 0s in the end. - Calculate the byte number of each compressed group. The remaining byte number of each group is called weight group size or WGS. One WGS is of 32-bit wise. - Store WGS, WMB and compressed weight into three separated memory surfaces. The diagram below shows the memory mapping of compressed weight format. .. image8 .. _fig_memory_mapping_of_compressed_weight: .. figure:: format_memory_mapping_of_compressed_weight.svg :align: center Memory mapping of compressed weight Bias Data Format ---------------- Bias data is another optional input data for convolution layers. When this option is enabled, DLA engine will add the bias data to result of convolution before writing back to memory. There are three types of bias data, - Per layer bias data - Per channel bias data - Per element bias data They both store in memory for DLA engine to fetch. If the output feature data cube is WxHxC, check below table for the corresponding bias cube size: +-------------+------------------+ | Per Layer | 1x1x1 (Register) | +=============+==================+ | Per Channel | 1x1xC | +-------------+------------------+ | Per Element | WxHxC | +-------------+------------------+ For INT pipeline, bias data can be either INT8 or INT16, and FP16 type of bias data is in16-bit fp16 format. They are generated along with CNN network. The memory mapping of bias data is described as below: **Per Channel:** - Two bytes per element with INT16/FP16 or 1 byte per element with INT8 .. image9 Memory Mapping of Per Channel Bias Data (Case 1) .. _fig_memory_mapping_of_per_channel_bias_data_case1: .. figure:: format_memory_mapping_of_per_channel_bias_data_case1.svg :align: center Memory Mapping of Per Channel Bias Data (Case 1) - 2 bytes per element with INT8: .. image10 Memory Mapping of Per Channel Bias Data (Case 2) .. _fig_memory_mapping_of_per_channel_bias_data_case2: .. figure:: format_memory_mapping_of_per_channel_bias_data_case2.svg :align: center Memory Mapping of Per Channel Bias Data (Case 2) - 2 bytes per element with INT8: **Per Element:** - Two bytes per element with INT16/FP16 or 1 byte per element with INT8 .. image11 Memory Mapping of Per Element Bias Data (Case 1) .. _fig_memory_mapping_of_per_element_bias_data_case1: .. figure:: format_memory_mapping_of_per_element_bias_data_case1.svg :align: center Memory Mapping of Per Element Bias Data (Case 1) - 2 bytes per element with INT8: .. image12 Memory Mapping of Per Element Bias Data (Case 2) .. _fig_memory_mapping_of_per_element_bias_data_case2: .. figure:: format_memory_mapping_of_per_element_bias_data_case2.svg :align: center Memory Mapping of Per Element Bias Data (Case 2) PReLU Data Format ----------------- Each PReLU data just have one component and it will be fed into multiplier of SDP. PReLU always operated per-channel thus there is only one type of PReLU data: - Per channel PReLU data Per channel PReLU data is stored in memory in a continuous 1x1xC space. Be noted that C is in unit of channel. - For INT8/16, each channel can occupy 1 or 2 bytes depending on B/N/E RDMA_DATA_SIZE - In FP16 types, each channel need 2 bytes data The memory mapping of PRelu data is described as below: - Two bytes per element with INT16/FP16 or 1 byte per element with INT8 .. image13 Memory Mapping of Per Channel PReLU Data (Case 1) .. _fig_memory_mapping_of_per_channel_prelu_data_case1: .. figure:: format_memory_mapping_of_per_channel_prelu_data_case1.svg :align: center Memory Mapping of Per Channel PReLU Data (Case 1) - 2 bytes per element with INT8: .. image14 Memory Mapping of Per Channel PReLU Data (Case 2) .. _fig_memory_mapping_of_per_channel_prelu_data_case2: .. figure:: format_memory_mapping_of_per_channel_prelu_data_case2.svg :align: center Memory Mapping of Per Channel PReLU Data (Case 2) Batch Normalization Data Format ------------------------------- Batch Normalization data is another optional input data for batch normalization layers. Each normalization data consists of two parts, one is to add onto the feature data and the other is to multiple with the result after addition. There are two types of batch normalization data - Per channel batch normalization data - Per layer batch normalization data Per channel batch normalization data is stored in memory in a continuous 1x1xC space. Be noted that C is in unit of channel. - In INT8/16 types, each of the two parts of normalization data can be either 1 byte or 2 bytes, so each channel need 2*1 or 2*2 bytes data - In FP16 types, each of the two parts of normalization data is 2 byte, so each channel need 4 bytes data The pair data of each element are always packed together in memory. The memory mapping of data is described as below: - Two bytes per element with INT16/FP16 or 1 byte per element with INT8 .. image15 Memory Mapping of Batch Normalization Data (Case 1) .. _fig_memory_mapping_of_batch_normalization_data_case1: .. figure:: format_memory_mapping_of_batch_normalization_data_case1.svg :align: center Memory Mapping of Batch Normalization Data (Case 1) - 2 bytes per element with INT8: .. image16 Memory Mapping of Batch Normalization Data (Case 2) .. _fig_memory_mapping_of_batch_normalization_data_case2: .. figure:: format_memory_mapping_of_batch_normalization_data_case2.svg :align: center Memory Mapping of Batch Normalization Data (Case 2) Per layer batch normalization data is stored in register. Be noted that INT8 and INT16 here means the processing precision, so when the layer is running from INT16 to INT8 or INT8 to INT16 precision conversion, batch normalization data need set to processing precision which is always INT8. Element-Wise Data Format ------------------------ Element-Wise data is another optional input data for Element-Wise layers. Each Element-Wise data consists of just one part and either for ALU or multiplier. There are one type of element-wise data - Per element Element-Wise data Per element Element-Wise data is stored in memory with size of W x H x C. - In INT8 /16types, each of the two parts of element-wise data can be either 1 byte or 2 bytes, so each element need 1/2 bytes data - In FP16 types, each of the two parts of element-wise data is 2 bytes, so each element need 2 bytes data From algorithm perspective, element-wise employs ALU or MUL only but never both, however, DLA hardware support employ both operations for per-element operation, in this case, each element size should be x2 of description above; The memory mapping of data is described as below: - Two bytes per element with INT16/FP16 or 1 byte per element with INT8 .. image17 Memory Mapping of Element-Wise Data (Case 1) .. _fig_memory_mapping_of_element_wise_data_case1: .. figure:: format_memory_mapping_of_element_wise_data_case1.svg :align: center Memory Mapping of Element Wise Data (Case 1) - 2 bytes per element with INT8: .. image18 Memory Mapping of Element-Wise Data (Case 2) .. _fig_memory_mapping_of_element_wise_data_case2: .. figure:: format_memory_mapping_of_element_wise_data_case2.svg :align: center Memory Mapping of Element Wise Data (Case 2) Be noted that INT8 and INT16 here means the processing precision, so when the layer is running from INT16 to INT8 or INT8 to INT16 precision conversion, Element-Wise data need set to processing precision which is always INT8. Normally, one atom contains 1x1x32Bytes data, but it’s no longer true for: - Bias data format; - PReLU data format; - Batch normalization data format; - Element-wise data format The bytes-per-atom for those formats should be computed by: BytesPerAtom=ElementPerAtom \* ComponentsPerElement \* BytesPerComponent Where ElementPerAtom is decided by PROC_PRECISION of SDP data pipeline: +----------------+----------------+ | PROC_PRECISION | ElementPerAtom | +================+================+ | INT8 | 32 | +----------------+----------------+ | INT16/FP16 | 16 | +----------------+----------------+ ComponentsPerElement is decided by use case (or DATA_USE register): +-----------------------------------------+----------------------+ | Use case | ComponentsPerElement | +=========================================+======================+ | Bias | 1 | +-----------------------------------------+----------------------+ | PReLU | 1 | +-----------------------------------------+----------------------+ | BatchNormalization | 2 | +-----------------------------------------+----------------------+ | Element-wise (Only ALU or MUL enabled) | 1 | +-----------------------------------------+----------------------+ | Element-wise (Both ALU/MUL are enabled) | 2 | +-----------------------------------------+----------------------+ BytesPerComponent is decided by precision (or DATA_SIZE register) +-----------+-------------------+ | DATA_SIZE | BytesPerComponent | +===========+===================+ | ONE_BYTE | 1 | +-----------+-------------------+ | TWO_BYTE | 2 | +-----------+-------------------+ Alignment of Start Address and Stride ------------------------------------- Here is the conclusion of requirements of alignment: .. table:: Requirements of alignment :name: tab_requirements_of_alignment +----------+----------+----------+----------+----------+----------+ | Data | Alignmen | Alignmen | Alignmen | Alignmen | Alignmen | | format | t | t | t | t | t | | | of start | of line | of | of | of data | | | address | stride | surface | planar/ | size | | | | | stride | cube | | | | | | | stride | | +==========+==========+==========+==========+==========+==========+ | Feature | 32 bytes | 32 bytes | 32 bytes | 32 bytes | NA | | data | | | | | | | cube | | | | | | +----------+----------+----------+----------+----------+----------+ | uncompre | 256 | NA | NA | NA | 128 | | ssed/ | bytes | | | | bytes | | compress | | | | | | | ed | | | | | | | weight | | | | | | +----------+----------+----------+----------+----------+----------+ | WMB | 256 | NA | NA | NA | 128 | | | bytes | | | | bytes | +----------+----------+----------+----------+----------+----------+ | WGS | 256 | NA | NA | NA | 128 | | | bytes | | | | bytes | +----------+----------+----------+----------+----------+----------+ | Pitch | 32 bytes | 32 bytes | | NA | NA | | linear | | | | | | | pixel | | | | | | +----------+----------+----------+----------+----------+----------+ | Bias | 32 bytes | 32 bytes | 32 bytes | NA | NA | +----------+----------+----------+----------+----------+----------+ | PReLU | 32 bytes | N/A | N/A | NA | NA | +----------+----------+----------+----------+----------+----------+ | Batch | 32 bytes | NA | NA | NA | NA | | Normaliz | | | | | | | ation | | | | | | +----------+----------+----------+----------+----------+----------+ | Element- | 32 bytes | 32 bytes | NA | NA | 32bytes | | wise | | | | | | +----------+----------+----------+----------+----------+----------+