The NVIDIA® Deep Learning Accelerator (NVDLA) is a configurable fixed function hardware accelerator targeting inference operations in deep learning applications. It provides full hardware acceleration for a convolutional neural network (CNN) by exposing individual building blocks that accelerate operations associated with each CNN layer (e.g., convolution, deconvolution, fully-connected, activation, pooling, local response normalization, etc.). Maintaining separate and independently configurable blocks means that the NVDLA can be sized appropriatley for many smaller applications where inferencing was previously not feasible due to cost, area, or power constraints. This modular architecture enables a highly configurable solution that readily scales to meet specific inferencing needs.
About This Document
This document focuses on the logical organization and control of the NVIDIA Deep Learning Accelerator. It provides information for those blocks and interfaces that control fundamental operations. The blocks detailed in this document include a functional overview, configuration options, and register listings for that block. All features and functions of all blocks may not be present in every NVDLA implementation.
NVDLA operation begins with the management processor (either a microcontroller or the main CPU) sending down the configuration of one hardware layer, along with an “activate” command. If data dependencies do not preclude this, multiple hardware layers can be sent down to different blocks and activated at the same time (i.e., if there exists another layer whose inputs do not depend on the output from the previous layer). Because every block has a double-buffer for its configuration registers, it can also capture a second layer’s configuration to begin immediately processing when the active layer has completed. Once a hardware engine finishes its active task, it will issue an interrupt to the management processor to report the completion, and the management processor will then begin the process again. This command-execute-interrupt flow repeats until inference on the entire network is complete.
NVDLA has two modes of operation: independent mode and fused mode. Independent operation refers to each individual block being configured for when and what it executes, with each block working on its assigned task. Independent operation begins and ends with the assigned block performing memory-to-memory operations, in and out of main system memory or dedicated SRAM memory. Fused operation is similar to independent operation, however, some blocks can be assembled as a pipeline; this improves performance by bypassing the round trip through memory, instead having blocks communicate with each other through small FIFOs (i.e., the convolution core can pass data to the Single Data Point Processor, which can pass data to the Planar Data Processor, and in turn to the Cross-channel Data Processor without performing memory-to-memory operations in between).
Each block in the NVDLA architecture exists to support specific operations integral to inference on deep neural networks. Inference operations are divided into five groups:
Convolution operations (Convolution core and buffer blocks)
Single Data Point operations (Activation engine block)
Planar Data operations (Pooling engine block)
Multi-Plane operations (Local resp. norm block)
Data Memory and Reshape operations (Reshape and Bridge DMA blocks)
Different deep learning applications require different inference operations. For example, the workload of real image segmentation is very different from that of face-detection. As a result, performance, area, and power requirements for any given NVDLA design will vary. The NVDLA architecture implements a series of hardware parameters that are used to define feature selection and design sizing. These hardware parameters provide the basis for creating an NVDLA hardware design specification. The design specification identifies the supported features and performance characteristics for an NVDLA implementation.
Note
The descriptions in the following sections contain references to or identify various hardware paramters and settings that might influence performance. Refer to the Hardware Paramters sections of this document for more information.
Convolution operations work on two sets of data: one set of offline-trained “weights” (which remain constant between each run of inference), and one set of input “feature” data (which varies with the network’s input). The NVDLA Convolution Engine exposes parameters that enable several different modes of operation. Each of these modes include optimizations that improve performance over a naive convolution implementation:
Direct
Image-input
Winograd
Batching
Enabling different modes of operation allows for the ability to map many different sizes of convolutions onto the hardware with higher efficiency. Support for sparse weight compression saves memory bandwidth. Built-in Winograd convolution support improves compute efficiency for certain sizes of filters. Batching convolution, can save additional memory bandwidth by reusing weights when running multiple inferences in parallel. To avoid repeated accesses to system memory, the NVDLA convolution engine has an internal RAM reserved for weight and input feature storage, referred to as the “convolution buffer”. This design greatly improves memory efficiency over sending a request to the system memory controller for each independent time a weight or feature is needed.
Direct convolution mode is the basic mode of operation. NVDLA incorporates a wide multiply–accumulate (MAC) pipeline to support efficient parallel direct convolutional operation. There are two key factors that impact convolution function performance: Memory bandwidth and MAC efficiency.
NVDLA supports two memory bandwidth optimization features that can significantly help to reduce memory bandwidth requirements for CNN layer(s) that require huge data exchange, e.g. fully-connected layers.
Sparse compression. The sparser the feature data and/or weight data, the less traffic on memory bus. A 60% sparse network (60% of the data are zero) can almost cut the memory bandwidth requirement to half.
Second memory interface. Provides efficient on-chip buffering, which can increase memory traffic bandwidth and also reduce the memory traffic latency. Usually an on-chip SRAM can provide 2x~4x of DRAM bandwidth with 1/10 ~ 1/4 latency.
The second key factor that impacts convolution function performance is MAC efficiency. The number of MAC instances is determined by (Atomic-C * Atomic-K). However, if a layer’s input feature data channel number is not aligned with the Atomic-C setting or the output feature data kernel number is not aligned with the Atomic-K setting, there will be times that not all MACs are valid which will result in a drop in MAC utilization. For example, if the NVDLA design specification has Atomic-C = 16 and Atomic-K = 64 (which would result in 1024 MAC instances), and one layer of the network has the input feature data channel number = 8 and output feature data kernel number = 16, then the MAC utilization will be only 1/8th (i.e., only 128 MACs will be utilized with the others being idle at all times).
Hardware Parameters:
Atomic – C sizing
Atomic – K sizing
Data type supporting
Feature supporting – Compression
Feature supporting – Second Memory Bus
Image-input mode is a special direct convolution mode for the first layer, which contains the input feature data from an image surface. Considering that the image surface format is quite different from the normal feature data format, feature data fetching operations follow a different path from direct convolution operations. Normally the first layer only has 3 channels for image input, additional logic was added here to enhance MAC utilization. Even though the first layer has 3 (or even 1) channel, a channel extension feature maintains average MAC utilization close to 50%, even if Atomic-C setting is large (e.g., 16).
Hardware Parameters:
All from Direct Convolution mode +
Image input support
Winograd convolution refers to an optional algorithm used to optimize the performance of direct convolution. The Winograd convolution reduces the number of multiplications, while increasing the adders to deal with the additional transformation. As the number of additions for both pre-calculation and post-calculation is much less than the number of operations in the MAC array, the overall number of operations is reduced. A large number of MAC operations are avoided for the same convolutional function. For example, a 3x3 filter-sized convolution with winograd, reduces the number of MAC operations by a factor of 2.25x, improving both performance and power efficiency. Weight conversion is done offline, so the total weight data size is expected to increase. Winograd feature is very useful for the maths-limited layers; 3x3 or larger filter-sized layers are always maths-limited, so match well to the winograd feature.
The equation of Winograd convolution used in convolution core is:
Here symbol \(\odot\) indicates element-wise multiplication. That means the winograd function requires a pre-calculation of fixed matrix operation before normal direct convolutional MAC array and a post-calculation of another fixed matrix operation after normal direct convolutional MAC array.
Hardware Parameters:
Feature supporting – Winograd
The NVDLA batching feature supports processing of multiple sets of input activations (from multiple images) at a time. This re-uses weights and saves significant memory bandwidth, improving performance and power. The memory bandwidth requirement for fully-connected layers is much larger than the calculation resource. The size of weight data in fully-connected layers is significant and is only used a single time in MAC functions (this is one of the leading causes of bottlenecks in memory bandwidth). Allowing multiple sets of activations to share the same weight data means they can run at the same time (reducing overall run-time). The run-time for a single batch is close to that for a single-layer; overall performance is close to [batching_size] X [single-layer performance].
Note
Support for large batching sizes means a large area cost for activation buffering. Maximum batching size is limited by the convolution buffer size, so the maximum batching number is a hardware limitation in the design specification.
Hardware Parameters:
Feature batch support
Max batch number
The Convolution buffer is one pipeline stage of the Convolution core. It contains both the weight data and the feature data for the convolution function. The ratio of weight data size to feature data size varies within different networks and even within a single network (different layers can have completely different ratios between feature and weight data). To accomodate these differences the convolution buffer enables a configurable storage strategy for both weight data and feature data.
The Convolution buffer requires at least 4 ports for data access:
Read port for feature data
Read port for weight data
Write port for feature data
Write port for weight data
Note
If compression features are supported, ports for compression tags are required. There are different ways these ports can be shared; refer to the reference design and documentation for more information – banking with dedicated configuration.
The Convolution buffer size depends on various factors; the primary factor is CNN size (i.e. feature data size and weight data size). It is preferable if the full size of either weight data or feature data of one hardware layer can be stored in the Convolution buffer (removes need to fetch data multiple times). Convolution read bandwidth determines the width of the read port. In order to feed the required amount of Atomic-C data in a single cycle,(data_size*Atomic-C) of data width is required. For example, for an atomic-C of 16 on an INT8 convolution function, a 128-bit width (16 bytes) is required.
Hardware Parameters:
BUFF bank #
BUFF bank size
The Single Data Point Processor (SDP) allows for the application of both linear and non-linear functions onto individual data points. This is commonly used immediately after convolution in CNN systems. The SDP provides native support for linear functions (e.g., simple bias and scaling) and uses lookup tables (LUTs) to implement non-linear functions. This combination supports most common activation functions as well as other element-wise operations including: ReLU, PReLU, precision scaling, batch normalization, bias addition, or other complex non-linear functions, such as a sigmoid or a hyperbolic tangent.
Hardware Parameters:
SDP function support
SDP throughput
NVDLA supports multiple instances of linear functions (which are mostly scaling functions). There are several methods that can be used for of setting the scaling factor and bias: 1) CNN setting - the scaling factor and bias are the same throughout the whole CNN, this scaling factor are comes from a register configuration; 2) Channel setting - the scaling factor and bias are the same within a single planar (i.e., the same channel value), these scaling factors come from the memory interface; 3) Per pixel setting - the scaling factor and bias are different for every single feature, the factors and bias will come from the memory interface.
Precision Scaling. Control memory bandwidth throughout the full inference process; feature data can be scaled to its full range before chunking into lower precision and writing to memory . Scale key resources (e.g., MAC array) to support full range for best inference result (other linear functions may be applied). Revert input data before any of the non-linear functions (i.e., keep input data of non-linear functions as original data).
Batch Normalization. In an inference function batch normalization requires a linear function with a trained scaling factor. SDP can support a per-layer parameter or a per-channel parameter to do the batch normalization operation.
Bias Addition. Some layers require the bias function at the output side, which means that they need to provide an offset (either from a per-layer setting or per-channel memory surface or per-feature memory surface) to the final result.
Element-Wise Operation. The element-wise layer (used in some CNN) refers to a type of operation between two feature data cubes which have the same W, H and C size. These two W x H x C feature data cubes do element-wise addition, multiplication or max/min comparison operation and output one W x H x C feature data cube. NVDLA supports common operations in element-wise operations (e.g., add, sub, multiply, max).
There are several non-linear functions that are required to support Deep Learning algorithms. Some of these are supported using dedicated hardware logic while more complex functions incorporate the use of a dedicated Look-Up-Table.
ReLU, for an input \(x\), the output is \(\textrm{max}(x, 0)\).
PReLU, different from ReLU, PReLU still keep a value small linear factor instead of cutting to zero:
\(y = \begin{cases} x & x > 0 \\ k * x & x < 0 \end{cases}\)
Sigmoid, for an input \(x\), the output is \(\frac{1}{1+e^{-x}}\)
Hyperbolic tangent, for an input \(x\), the output is \(\frac{1-e^{-2x}}{1+e^{-2x}}\)
And more…
The Planar Data Processor (PDP) supports specific spatial operations that are common in CNN applications. It is configurable at runtime to support different pool group sizes, and supports three pooling functions:
maximum-pooling – get maximum value from pooling window.
minimum-pooling – get minimum value from pooling window.
average-pooling – average the feature value in the pooling window.
The PDP unit has a dedicated memory interface to fetch input data from memory and outputs directly to memory.
Hardware Parameters:
PDP throughput
The Cross-channel Data Processor (CDP) is a specialized unit built to apply the local response normalization (LRN) function - a special normalization function that operates on channel dimensions, as opposed to the spatial dimensions.
Hardware Parameters:
CDP throughput
The bridge DMA (BDMA) module provides a data copy engine to move data between the system DRAM and a dedicated high-performance memory interface, where present. Provides an accelerated path to move data between these two non-connected memory systems.
Hardware Parameters:
BDMA function support
The data reshape engine performs data format transformations (e.g., splitting or slicing, merging, contraction, reshape-transpose). Data in memory often needs to be reconfigured or reshaped in the process of performing inferencing on a convolutional network. For example: “slice” operations may be used to separate out different features or spatial regions of an image; “reshape-transpose” operations (common in deconvolutional networks) create output data with larger dimensions than the input dataset.
The Rubik function transforms data mapping format without any data calculations. It supports three working modes:
Contract Mode. Contract mode in Rubik transforms mapping format are used to de-extend the cube. It’s a second hardware layer to support deconvolution. Normally, a software deconvolution layer has deconvolution x stride and y stride that are greater than 1; with these strides the output of phase I hardware-layer is a channel-extended data cube.
Split Mode and Merge Mode. Split and merge are two opposite operation modes in Rubik. Split transforms a data cube into M-planar formats (NCHW). The number of planes is equal to channel size. Merge transforms a serial of planes to a feature data cube.
Hardware Parameters:
Rubik function support
Different deep learning applications require different inference operations. For example, the workload of real image segmentation is very different from that of face-detection. As a result, performance, area, and power requirements for any given NVDLA design will vary. NVDLA addresses this with a set of configurable hardware parameters that are used to create an implementation that fits the application needs.
Hardware parameters provide the basis for creating an NVDLA hardware design specification. The design specification identifies the supported features and performance characteristics for an NVDLA implementation. There are two categories of hardware parameters: Feature Selection and Design Sizing. A given NVDLA implementation is defined by the parameters and settings selected.
Feature parameters identify which individual features an NVDLA implementation will support. Configurable options include:
NVDLA can support one data type for specific network, or support multiple data types for more general purpose. NVDLA hardware architecture can make proper sharing between the different data types for both area cost and power efficiency consideration.
Parameter: Data type supporting
Values: Binary/INT4/INT8/INT16/INT32/FP16/FP32/FP64
Affected operations: All
Winograd is an optimization feature for the convolutional function. It can improve performance by increasing MAC efficiency, and it can also help with the overall power efficiency. See Winograd Convolution Mode for more information.
Parameter: Feature supporting – Winograd
Possible values: Yes/No
Affected operations: Convolution
Batching is an optimization feature for convolution. It improves performance by both increasing MAC efficiency and saving memory traffic. See Batching Convolution Mode for more information.
Parameter: Feature supporting – batch
Possible values: Yes/No
Affected operations: Convolution
Sparse compression is an optimization feature for convolution. It can reduce the total amount of memory traffic and thus improve performance and save power. See Direct Convolution Mode for more information.
Parameter: Feature supporting – Sparse Compression
Possible values: Weight/Feature/Neither/Both
Affected operations: Convolution
NVDLA always has a basic interface to external memory over its DBBIF. Besides that, NVDLA can also support a second memory bus interface names SRAMIF. This interface can connect to on-chip SRAM or other high-bandwidth low-latency buses to improve the overall performance.
Parameter: Feature supporting – Second Memory Bus
Possible values: Yes/No
Affected operations: All
Planar image data is an important input resource to Deep Learning, and there are a large number of image surface formats. So, the supported formats for NVDLA input can be very important for the first hardware layer of NVDLA.
Parameter: Image input support
Possible values: combinations of 8-bit/16-bit/both; RGB/YUV/both; non-planar/semi-planar/full-planar
Affected operations: Convolution
There are many nonlinear curves used as activation functions for Deep Learning, including (P)ReLU, tanh, sigmoid, and more. Some of these, such as ReLU, are very simple and can be implemented trivially with scaling beyond a threshold; others require extra memory to approximate using a lookup table.
Parameter: SDP function support
Possible values: Scaling/LUT
Affected operations: Single Data Point
In case that NVDLA supports second memory interface, the Bridge DMA (BDMA) unit can do data copy between the main memory interface an second memory interface.
Parameter: BDMA function support
Possible values: Yes/No
Affected operations: MISC
The Rubik function transforms data mapping format without any data calculations. See Data Reshape Engine for more information.
Parameter: Rubik function support
Possible values: Yes/No
Affected operations: MISC
Design sizing parameters indicate the parallelism that is supported in the NVDLA Hardware. A larger value usually means higher performance but with an associated higher area and/or power cost.
This value indicates the parallel MAC operation in input feature channel dimension. This parameter impacts total MAC number, convolutional buffer read bandwidth.
Parameter: Atomic – C sizing
Values: 16~128
Affected operations: Convolution
This value indicates the parallel MAC operation at output feature channel dimension. This parameter impacts total MAC number, accumulator instance number, convolutional write-back bandwidth.
As the MAC array has 2 dimensions, so the total MAC number is (Atomic-C * Atomic-K).
Parameter: Atomic – K sizing
Range of values: 4~16
Affected scope: Convolutional function
Note
Both Atomic-C and Atomic-K parameters are referring to the lowest precision, the higher precision may be reduced accordingly. For example, if NVDLA supports both INT8 and INT16, then the Atomic-C and Atomic-K parameters are referring to INT8 case, INT16 is expected to have lower parallel.
The Single Data Point (SDP) throughput indicates the number of instances of SDP pipelines. The number of SDP pipelines determines the number of output features that can be generated each cycle.
Parameter: SDP throughput
Range of values: 1~16
Affected scope: SDP
The Planar Data Processor (PDP) throughput indicates the number of output features that can be generated each cycle. A value of 0 indicates that no PDP blocks will be included and the planar data processor operation will not be supported.
Parameter: PDP throughput
Range of values: 0~4
Affected operations: PDP
The CDP throughput indicates the number of output features that can be generated every cycle. A value of 0 indicates that no CDP block will be included and multi-plane operations will not be supported by the resulting implementation
Parameter: CDP throughput
Range of values: 0~4
Affected operations: CDP
This value indicates the number of convolutional buffer banks. The number of banks defines the granularity of CBUF storage allocation between weights and activations. Together with the bank size parameter, they determine the size of overall convolutional buffer.
Parameter: BUFF bank #
Range of values: 2~32
Affected operations: Convolution
This value indicates the size of a single convolutional buffer bank. Together with the bank number parameter, they determine the size of overall convolutional buffer.
Parameter: BUFF bank size
Range of values: 4KB~32KB
Affected operations: Convolution
This value indicates the maximum batching number that convolution function can support. Usually larger value of this parameter has area impact as more buffering is required at accumulator side.
Parameter: MAX Batch number
Range of values: 1~32
Affected operations: Convolution
NVDLA supports multiple data type inference based on different workloads. Use of these parameters can be used to improve network accuracy for a given power and performance constraint. Floating point data has a high precision (FP64/FP32/FP16); integer data type (INT16/INT8/INT4), or even single bit binary can be used for lower precision applications.
The precision scaling convertors are normally used before some very critical limited resources, like before writing data to memory or before entering MAC array.
The formula for convertor is: \(y = \textrm{saturation} ((x - \textrm{offset}) * \textrm{scaling} >> \textrm{shifter})\).
A shifter is mostly used at the bits adjustment in the middle of the pipeline. For example, the accumulator bit width is far more than the input data, so before data sent out to SDP, we need to chunk it by a shifter.
Shifter is a simplified convertor, formula as: \(y = \textrm{saturate}(x << \textrm{shifter})\).
LUT are used to deal with non-linear function in a network such as sigmoid and tanh activation functions or for local response normalization.
Small sized NVDLA implementations target smaller workloads, as such, these implementations only need to have very basic support. Because of the light workload, 64 INT8 MACs with Atomic-C=16 and Atomic-K=4 should be good enough. All other optimization features can be removed to save area. For image input format, supporting a basic format like A8R8G8B8 is likley good enough. If pooling and normalization functions are required, it is possible to limit throughput. As to the convolutional buffer, 4 banks each with 8KB size (totally 32KB size) can support.
Example hardware parameter settings:
Data type supporting = INT8
Feature supporting - Winograd = No
Feature supporting - Second Memory Bus = No
Feature supporting - Compression = No
Image input support = R8, A8B8G8R8, A8R8G8B8, B8G8R8A8, R8G8B8A8, X8B8G8R8, X8R8G8B8, B8G8R8X8, R8G8B8X8, Y8___U8V8, Y8___V8U8
SDP function support = Single Scaling
BDMA function support = No
Rubik function support = No
Atomic - C sizing = 8
Atomic - K sizing = 8
SDP throughput = 1
PDP throughput = 1
CDP throughput = 1
BUFF bank # = 32
BUFF bank size = 4KB
Larger NVDLA implementations target heavier workloads. This model serves as a better choice when the primary emphasis is on higher performance and versatility. Increasing the Atomic-C and Atomic-K to 64/16 increases NVDLA performance to a maximum of 2K operations every cycle; enabling all other optimizations increases real operations further. Other post-processing throughput also needs to increase (e.g., PDP and CDP throughput changed to 4). When targeting a larger CNN, set a larger convolution buffer (e.g., 32KB * 16 = 512KB).
Example hardware parameter settings:
Data type supporting = FP16/INT16
Feature supporting - Winograd = Yes
Feature supporting - Second Memory Bus = Yes
Feature supporting - Compression = Yes
Image input support = A8R8G8B8/YUV16 Semi-planar
SDP function support = Scaling/LUT
BDMA function support = Yes
Rubik function support = No
Atomic - C sizing = 64
Atomic - K sizing = 16
SDP throughput = 16
PDP throughput = 4
CDP throughput = 4
BUFF bank # = 16
BUFF bank size = 32KB
The NVDLA has four interfaces to the system as a whole. These are:
Configuration space bus (“CSB”). The host system accesses and configures the NVDLA register set with a very simple address/data interface. Some systems may directly connect the host CPU to the CSB interface, with a suitable bus bridge; other, potentially larger, systems will instead connect a small microcontroller to the CSB interface, offloading some of the work of managing the NVDLA to the external core.
External interrupt (“IRQ”): Certain states in the NVDLA demand asynchronous reporting to the processor that is commanding the NVDLA, these states include operation completion and error conditions,. The external interrupt interface provides a single output pin that complements the CSB interface.
Data backbone (“DBBIF”): The NVDLA contains its own DMA engine to load and store values (including parameters and datasets) from the rest of the system. The data backbone is an AMBA AXI4-compliant interface that is intended to access large quantities of relatively high-latency memory (such as system DRAM).
SRAM connection (“SRAMIF”): Some systems may have the need for more throughput and lower latency than the system DRAM can provide, and may wish to use a small SRAM as a cache to improve the NVDLA’s performance. A secondary AXI4-compliant interface is provided for an optional SRAM to be attached to the NVDLA.
Below, we present two examples of platforms that integrate an NVDLA, and how they attach these external connections to the rest of the system.
Fig. 6 shows a small system, for which NVDLA is directly connected to the main CPU. The small system has no NVDLA-dedicated SRAM, and all accesses hit the main system memory. By comparison, Fig. 7 shows a somewhat larger system, in which the NVDLA connects to a microcontroller, which is responsible for managing the small details of programming the NVDLA (and, as such, freeing the main CPU from servicing low-level NVDLA interrupts). The latter system also integrates a SRAM, attached to NVDLA. (Other units on the system may also have connections to this SRAM, and share it for their own needs; this is not shown in the diagram.)
The CPU uses the CSB (Configuration Space Bus) interface to access NVDLA registers. The CSB interface is intentionally extremely simple, and low-performance; as such, it should be simple to build an adapter between the CSB and any other system bus that may be supported on a platform. The CSB bus consists of three channels: the request channel, the read data channel, and the write response channel. These channels are as described below.
The CSB interface uses a single clock domain, shared between NVDLA and the requester.
The request channel follows a valid/ready protocol; a data transaction
occurs on the request channel when and only when the valid
signal (from
the host) and the ready
signal (from NVDLA) are both asserted in the
same clock cycle. Each request to CSB has a fixed request size of 32 bits
of data, and has a fixed 16bit address size. CSB does not support any form
of burst requests; each packet sent down the request channel is independent
from any other packet.
Data field |
# Bits |
Direction |
Description |
---|---|---|---|
|
1 |
Input |
Indicates that a request is valid |
|
1 |
Output |
Indicates that the receiver is ready to take a request |
|
16 |
Input |
Address. Aligned to word boundary. |
|
32 |
Input |
Write data |
|
1 |
Input |
Write flag.
|
|
1 |
Input |
Non-posted write transaction indicator.
Posted write transactions are writes where the requester does not expect to and will not receive a write completion from receiver on write ack channel. The requester will not know if the write encounters an error. Non-posted write transactions are writes where the requester expects to receive a write completion or write error on write ack channel from receiver. |
The read data response channel is described in the below table. NVDLA returns read-response data to the host in strict request order; that is to say, each request packet (above) for which “write” is set to 0 will have exactly one response, and that response cannot jump forward or backwards relative to other reads.
The read data channel follows a valid-only protocol; as such, the host cannot apply back-pressure to the NVDLA on this interface.
Note
NVDLA does not support error reporting from the CSB. Illegal reads (e.g. reads directed at an address hole) will return zeroes.
Data field |
# Bits |
Direction |
Description |
---|---|---|---|
|
1 |
Output |
Indicates that read data is valid. |
|
32 |
Output |
Data corresponding to a read request, or zero in the event of an error. |
The signals associated with the write response channel are described in the below table. NVDLA will return write completion to the host in request order for every non-posted write.
The write completion channel also follows a valid-only protocol, and as such, the host cannot back-pressure NVDLA on this interface.
Data field |
# Bits |
Direction |
Description |
---|---|---|---|
|
1 |
Output |
Indicates that a CSB write has completed. |
Along with the configuration space bus, NVDLA provides an asynchronous
(interrupt-driven) return channel to deliver event notifications to the CPU.
The interrupt signal is a level-driven interrupt that is asserted high as
long as the NVDLA core has interrupts pending. Interrupts are pending if
any bits are set in GLB’s INTR_STATUS
register that are also not masked
out (i.e., set to zero) in the INTR_MASK
register. The NVDLA interrupt
signal is on the same clock domain as the CSB interface.
Data field |
# Bits |
Direction |
Description |
---|---|---|---|
|
1 |
Output |
Active high while an interrupt is pending from NVDLA. |
NVDLA has two major interfaces to interact with the memory system, these are
called the DBBIF (which is referred to as core2dbb
in signal naming) and
the SRAMIF (which is referred to as core2sram
in signal naming). The
DBBIF interface is intended to be connected to an on-chip network which
connects to the system memory, while SRAMIF is intended to be connected with
an optional on-chip SRAM with lower memory latency and potentially higher
throughput. Both the DBBIF and SRAMIF interfaces are AXI4 compliant. This
section describes the DBBIF interface in detail.
The NVDLA data backbone interface supports a configurable data bus width of 32, 64, 128, 256 or 512-bits. To tolerate memory latency, internal buffers can be configured to support a configurable number of outstanding requests up to 256.
The data backbone interface follows a AXI-like protocol, but makes assumptions to simplify the interface protocol.
Always issues incremental burst request
Burst size always align with data width
Request address always aligned to data width
Writes must always be acknowledged, reads must always get return data
Writes must be committed to memory when NVDLA gets a write acknowledge
Reads must always get the actual value from memory
Write acknowledge must be returned in write request order
Read data must be returned in read request order
The NVDLA DBBIF assumes synchronized data backbone interface with single clock domain and reset. Therefore, all NVDLA DBBIF ports are part of the main NVDLA core clock domain. Synchronization to the SOC data backbone will need to be done outside the NVDLA core.
The table below lists all signals on AW channel, with an implied prefix of
nvdla_core2dbb_aw_
.
Data field |
# Bits |
Direction |
Description |
---|---|---|---|
|
1 |
Output |
Write request ready |
|
1 |
Input |
Write request ready |
|
4 |
Output |
Burst length |
|
Config |
Output |
Write address, can be configured to be 32 or 64bit |
|
8 |
Output |
Write request ID tag |
The table below lists all signals on the AR channel, with an implied prefix
of nvdla_core2dbb_ar_
.
Data field |
# Bits |
Direction |
Description |
---|---|---|---|
|
1 |
Output |
Read request valid |
|
1 |
Input |
Read request ready |
|
4 |
Output |
Burst length |
|
Config |
Output |
Read address, can be configured to be 32 or 64bit |
|
8 |
Output |
Read request ID tag |
The table below lists all signals on the W channel, with an implied prefix
of nvdla_core2dbb_w_
.
Data field |
# Bits |
Direction |
Description |
---|---|---|---|
|
1 |
Output |
Write data valid |
|
1 |
Input |
Write data ready |
|
Config |
Output |
Write data. Width configurable to 32/64/128/256/512bit |
|
1 |
Output |
Last write indicator |
|
Config |
Output |
Write-strobes to specify the byte lanes of the data bus that contain
valid information. Each bit in wstrb represents 8bit on data bus;
The width of |
The table below lists all signals on the B channel, with an implied prefix
of nvdla_core2dbb_b_
.
Data field |
# Bits |
Direction |
Description |
---|---|---|---|
|
1 |
Input |
Write response valid |
|
1 |
Output |
Write response ready |
|
8 |
Input |
Write response ID |
The table below lists all signals on the R channel, with an implied prefix
of nvdla_core2dbb_r_
.
Data field |
# Bits |
Direction |
Description |
---|---|---|---|
|
1 |
Input |
Write response valid |
|
1 |
Output |
Write response ready |
|
1 |
Input |
Last read data indicator. |
|
Config |
Input |
Read data with configurable width of 32/64/128/256/512b |
|
8 |
Input |
Read response ID |
The optional NVDLA SRAM interface is used when there is an on-chip SRAM for
the benefit of lower latency and higher throughput. The SRAM interface
protocol is exactly the same as DBBIF interface, but signals have been
renamed to the prefixes nvdla_core2sram_{aw,ar,w,b,r}_
, for the aw, ar,
w, b and r channels respectively.
The return order between write acknowledges from SRAMIF and DBBIF is not restricted.
For example, there are two BDMA layers, layer0 and alyer1. Layer0 writes to DBBIF and layer1 writes to SRAMIF.
Layer1 may receive write acknowledge from SRAMIF before layer0 which receive write acknowledge from DBBIF.
This section describes the register address space and register definitions. For each sub-unit, there are status registers, configuration registers, command registers and profiling registers.
One traditional procedure to program hardware is as follows: first, the CPU configures registers on an engine, then it sets the engine’s “enable” bit, then it waits for the hardware to produce a “done” interrupt, and finally it starts the process over again with a new set of registers. This style of programming model will result in the hardware becoming idle between two consecutive hardware layers, which reduces system efficiency.
In order to hide the CPU’s reprogramming latency, NVDLA introduces the concept of ping-pong register programming for per-hardware-layer registers. For most NVDLA subunits, there are two groups of registers; when the subunit is executing using the configuration from the first register set, the CPU can program the second group in the background, setting the second group’s “enable” bit when it is done. When the hardware has finished processing the layer described by the first register set, it will clear the “enable” bit of the first register set, and then switch to the second group if the second group’s “enable” bit has already been set. (If the second group’s “enable” bit has not yet been set, then the hardware becomes idle until programming is complete.) The process, then, repeats, with the second group becoming the active group, and the first group becoming the “shadow” group to which the CPU writes in the background. This mechanism allows the hardware to switch smoothly between active layers, wasting no cycles for CPU configuration.
Note
Unlike a “shadow register” programming model, values written to the inactive group in the “ping-pong” programming model do not get copied to a primary group on activation. As such, the CPU should make sure that all registers in a group have been programmed before enabling the hardware layer to run.
The NVDLA core is built as a series of pipeline stages; each stage is used to handle hardware layers in whole or in part. These pipeline stages are:
CDMA (convolution DMA)
CBUF (convolution buffer)
CSC (convolution sequence controller)
CMAC (convolution MAC array)
CACC (convolution accumulator)
SDP (single data processor)
SDP_RDMA (single data processor, read DMA)
PDP (planar data processor)
PDP_RDMA (planar data processor, read DMA)
CDP (channel data processor)
CDP_RDMA (channel data processor, read DMA)
BDMA (bridge DMA)
RUBIK (reshape engine)
The first five pipeline stages are part of the convolution core pipeline; all of these pipeline stages (except for CBUF and CMAC) use linked ping-pong buffers in order to work together to form HW layers.
Each pipeline stage has the ping-pong mechanism built into its register
file, as shown in Fig. 9. In detail, each register file
implementation has three register groups; the two ping-pong groups
(duplicated register group 0, and group 1) share the same addresses, and the
third register group is a dedicated non-shadowed group (shown above as the
“single register group”). The PRODUCER
register field in the
POINTER
register is used to select which of the ping-pong groups is
to be accessed from the CSB interface; the CONSUMER
register field
indicates which register the datapath is sourcing from. By default, both
pointers select group 0. Registers are named according to which register
set they belong to; a register is in a duplicated register group if its name
starts with D_
, and otherwise, it is in the single register group.
The registers in the ping-pong groups are parameters to configure hardware layers. Each group has an enable register field, which is set by software and cleared by hardware. The CPU configured all other fields in the group before the enable bit; when the enable bit is set, the hardware layer is ready to execute. At this point, any writes to register groups that have the enable bit set will be dropped silently until the hardware layer completes execution; then, the enable bit is cleared by hardware.
Note
If the enable field is set, the hardware layer may either be running or pending. Even if the hardware layer is not actively running (i.e., it is waiting to run), the CPU cannot clear the enable field; any write access to a register group for which the enable field is set will be silently dropped.
Most registers in the single-register groups are read-only status registers.
The CONSUMER
and PRODUCER
pointers, described above, reside in the
single group; the CONSUMER
pointer is a read-only register field that
the CPU can check to determine which ping-pong group the datapath has
selected, and the PRODUCER
pointer is fully controleld by the CPU, and
should be set to the correct group before programming a hardware layer.
The following is an example sequence for how to program an NVDLA subunit. Each NVDLA subunit has the same ping-pong register design; in this sequence, we choose the CDMA submodule as the example unit that we will program.
After reset, both group 0 and group 1 are in an idle state. The CPU
should read the CDMA_POINTER
register, and set PRODUCER
to the
value of CONSUMER
. (After reset, CONSUMER
is expected to be 0.)
The CPU programs the parameters for the first hardware layer into register
group 0. After configuration completes, the CPU sets the enable
bit in
the D_OP_ENABLE
register.
Hardware begins processing the first hardware layer.
The CPU reads the S_STATUS
register to ensure that register group 1
is idle.
The CPU sets PRODUCER
to 1 and begins programming the parameters for
the second hardware layer into group 1. After those registers are
programmed, it sets the enable bit in group 1’s D_OP_ENABLE
.
The CPU checks the status of the register group 0 by reading
S_STATUS
; if it is still executing, the CPU waits for an interrupt.
Hardware finishes the processing of the current hardware layer. Upon
doing so, it sets the status of the previously active group to idle in
the S_STATUS
register, and clears the enable
bit of the
D_OP_ENABLE
register.
Hardware advances the CONSUMER
field to the next register group (in
this case, group 1). After advancing the CONSUMER
field, it
determines whether the enable
bit is set on the new group. If so, it
begins processing the next hardware layer immediately; if not, hardware
waits until the enable
bit is set.
Hardware asserts the “done” interrupt for the previous hardware layer. If the CPU was blocked waiting for a “done” interrupt, it now proceeds programming, as above.
Repeat, as needed.
Note
The NVDLA hardware does not have intrinsic support for dependency tracking; that is to say, hardware layers that are running or pending do not have any mechanism of blocking each other, if one depends on the output of the other. As such, the CPU is responsible for ensuring that if a layer depends on the output of a previous layer, the consuming layer is not scheduled until the producing layer has finished executing.
Warning
This address space layout is not final, and should be expected to change in revisions of the NVDLA design leading up to version 1.0.
The NVDLA requires 256 KiB of MMIO address space for its registers.
Although the base address will vary from system to system, all registers on
the CSB interface start at a base address of 0x0000_0000
. Each subunit
inside of NVDLA is assigned 4 KiB of address space. (The CBUF subunit does
not have any registers.) The address mapping inside of NVDLA’s address
space is as shown in table Table 10.
Some hardware configurations may not have certain subunits enabled; for instance, smaller implementations of NVDLA may disable SDP, PDP, or CDP. In such a case, the address space of those subunits is reserved, and their registers are not accessible.
Note
Capability registers will be added to determine the configuration of each NVDLA implementation.
DLA sub-unit |
Start Address |
End Address |
Size (KiB) |
---|---|---|---|
GLB |
|
|
4 |
Reserved |
|
|
4 |
MCIF |
|
|
4 |
SRAMIF |
|
|
4 |
BDMA |
|
|
4 |
CDMA |
|
|
4 |
CSC |
|
|
4 |
CMAC_A |
|
|
4 |
CMAC_B |
|
|
4 |
CACC |
|
|
4 |
SDP (RDMA) |
|
|
4 |
SDP |
|
|
4 |
PDP (RDMA) |
|
|
4 |
PDP |
|
|
4 |
CDP (RDMA) |
|
|
4 |
CDP |
|
|
4 |
RUBIK |
|
|
4 |
Reserved |
|
|
188 |
Name |
Address |
Description |
---|---|---|
|
|
HW version of NVDLA |
|
|
Interrupt mask control |
|
|
Interrupt set control |
|
|
Interrupt status |
Name |
Address |
Description |
---|---|---|
|
|
Register0 to control the read weight of clients in MCIF |
|
|
Register1 to control the read weight of clients in MCIF |
|
|
Register2 to control the read weight of clients in MCIF |
|
|
Register0 to control the write weight of clients in MCIF |
|
|
Register1 to control the write weight of clients in MCIF |
|
|
Outstanding AXI transactions in unit of 64Byte |
|
|
Idle status of MCIF |
Name |
Address |
Description |
---|---|---|
|
|
Register0 to control the read weight of clients in MCIF |
|
|
Register1 to control the read weight of clients in MCIF |
|
|
Register2 to control the read weight of clients in MCIF |
|
|
Register0 to control the write weight of clients in MCIF |
|
|
Register1 to control the write weight of clients in MCIF |
|
|
Outstanding AXI transactions in unit of 64Byte |
|
|
Idle status of SRAMIF |
Name |
Address |
Description |
---|---|---|
|
|
Lower 32bits of source address |
|
|
Higher 32bits of source address when axi araddr is 64bits |
|
|
Lower 32bits of dest address |
|
|
Higher 32bits of dest address when axi awaddr is 64bits |
|
|
Size of one line |
|
|
Ram type of source and destination |
|
|
Number of lines to be moved in one surface |
|
|
Source line stride |
|
|
Destination line stride |
|
|
Number of surfaces to be moved in one operation |
|
|
Source surface stride |
|
|
Destination surface stride |
|
|
This register is not used in NVDLA 1.0 |
|
|
Set it to 1 to kick off operations in group0 |
|
|
Set it to 1 to kick off operations in group1 |
|
|
Enable/Disable of counting stalls |
|
|
Status register: idle status of bdma, group0 and group1 |
|
|
Counting register of group0 read stall |
|
|
Counting register of group0 write stall |
|
|
Counting register of group1 read stall |
|
|
Counting register of group1 write stall |
Name |
Address |
Description |
---|---|---|
|
|
Idle status of two register groups |
|
|
Pointer for CSB master and data path to access groups |
|
|
WMB and Weight share same port to access external memory. This register controls the weight factor in the arbiter. |
|
|
Indicates whether CBUF flush is finished after reset. |
|
|
Set it to 1 to kick off operation for current register group |
|
|
Configuration of operation: convolution mode, precision, weight reuse, data reuse. |
|
|
Input data format and pixel format |
|
|
Input cube’s width and height |
|
|
Input cube’s channel |
|
|
Input cube’s width and height after extension |
|
|
For image-in mode, horizontal offset and vertical offset of the 1 st pixel. |
|
|
Ram type of input RAM |
|
|
Higher 32bits of input data address when axi araddr is 64bits |
|
|
Lower 32bits of input data address |
|
|
Higher 32bits of input data address of UV plane when axi araddr is 64bits |
|
|
Lower 32bits of input data address of UV plane |
|
|
Line stride of input cube |
|
|
Line stride of input cube’s UV plane |
|
|
Surface stride of input cube |
|
|
Whether input cube is line packed or surface packed |
|
|
This address is reserved |
|
|
This address is reserved |
|
|
Number of batches |
|
|
The stride of input data cubes when batches > 1 |
|
|
Number of CBUF entries used for one input slice |
|
|
Number of slices to be fetched before sending update information to CSC |
|
|
Whether weight is compressed or not |
|
|
The size of one kernel in bytes |
|
|
Number of kernels |
|
|
Ram type of weight |
|
|
Higher 32bits of weight address when axi araddr is 64bits |
|
|
Lower 32bits of weight address |
|
|
Total bytes of Weight |
|
|
Higher 32bits of wgs address when axi araddr is 64bits |
|
|
Lower 32bits of wgs address |
|
|
Higher 32bits of wmb address when axi araddr is 64bits |
|
|
Lower 32bits of wmb address |
|
|
Total bytes of WMB |
|
|
Whether mean registers are used or not |
|
|
Global mean value for red in RGB or Y in YUV Global mean value for green in RGB or U in YUV |
|
|
Global mean value for blue in RGB or V in YUV Global mean value for alpha in ARGB/AYUV or X in XRGB |
|
|
Enable/disable input data converter in CDMA and number of bits to be truncated in the input data converter |
|
|
Offset of input data convertor |
|
|
Scale of input data convertor |
|
|
Convolution x stride and convolution y stride |
|
|
Left/right/top/bottom padding size |
|
|
Padding value |
|
|
Number of data banks and weight banks in CBUF |
|
|
Enable/Disable flushing input NaN to zero |
|
|
Count NaN number in input data cube, update per layer |
|
|
Count NaN number in weight kernels, update per layer |
|
|
Count infinity number in input data cube, update per layer |
|
|
Count infinity number in weight kernels, update per layer |
|
|
Enable/disable performance counter |
|
|
Count blocking cycles of read request of input data, update per layer |
|
|
Count blocking cycles of read request of weight data, update per layer |
|
|
Count total latency cycles of read response of input data, update per layer |
|
|
Count total latency cycles of read request of weight data, update per layer |
Note that some registers in the CDMA unit are only used in certain modes; if these modes are not shown as available in the hardware capability registers, their registers are not available either. These registers are as noted below:
Feature |
Registers |
---|---|
Image-in mode |
|
FP16 data format |
|
Weight compression |
|
Name |
Address |
Description |
---|---|---|
|
|
Idle status of two register groups |
|
|
Pointer for CSB master and data path to access groups |
|
|
Set it to 1 to kick off operation for current register group |
|
|
Configuration of operation: convolution mode, precision, weight reuse, data reuse. |
|
|
Input data format and pixel format |
|
|
Input cube’s width and height after extension |
|
|
Input cube’s channel after extension |
|
|
Number of batches |
|
|
Post extension parameter for image-in |
|
|
Number of CBUF entries used for one input slice |
|
|
Whether weight is compressed or not |
|
|
Weight’s width and height after extension |
|
|
Weight’s channel after extension and number of weight kernels |
|
|
Total bytes of Weight |
|
|
Total bytes of WMB |
|
|
Output cube’s width and height |
|
|
Output cube’s channel |
|
|
Equals to output_data_cube_width * output_data_cube_height - 1 |
|
|
Slices of CBUF to be released at the end of current layer |
|
|
Convolution x stride and convolution y stride after extension |
|
|
Dilation parameter |
|
|
Left/right/top/bottom padding size |
|
|
Padding value |
|
|
Number of data banks and weight banks in CBUF |
|
|
PRA truncate in Winograd mode, range: 0~2 |
Note that some registers in the CSC unit are only used in certain modes; if these modes are not shown as available in the hardware capability registers, their registers are not available either. These registers are as noted below:
Feature |
Registers |
---|---|
Image-in mode |
|
Weight compression |
|
Name |
Address |
Description |
---|---|---|
|
|
Idle status of two register groups |
|
|
Pointer for CSB master and data path to access groups |
|
|
Set it to 1 to kick off operation for current register group |
|
|
Configuration of operation: convolution mode, precision, etc. |
Name |
Address |
Description |
---|---|---|
|
|
Idle status of two register groups |
|
|
Pointer for CSB master and data path to access groups |
|
|
Set it to 1 to kick off operation for current register group |
|
|
Configuration of operation: convolution mode, precision, etc. |
Name |
Address |
Description |
---|---|---|
|
|
Idle status of two register groups |
|
|
Pointer for CSB master and data path to access groups |
|
|
Set it to 1 to kick off operation for current register group |
|
|
Configuration of operation: convolution mode, precision, etc. |
|
|
Input cube’s width and height after extension |
|
|
Input cube’s channel after extension |
|
|
Address of output cube |
|
|
Number of batches |
|
|
Line stride of output cube |
|
|
Line stride of surface cube |
|
|
Whether output cube is line packed or surface packed |
|
|
Number of bits to be truncated before sending to SDP |
|
|
Output saturation count |
Name |
Address |
Description |
---|---|---|
|
|
Idle status of two register groups |
|
|
Pointer for CSB master and data path to access groups |
|
|
Set it to 1 to kick off operation for current register group |
|
|
Input cube’s width |
|
|
Input cube’s height |
|
|
Input cube’s channel |
|
|
Lower 32bits of input data address |
|
|
Higher 32bits of input data address when axi araddr is 64bits |
|
|
Line stride of input cube |
|
|
Surface stride of input cube |
|
|
Configuration of BRDMA: enable/disable, data size, Ram type, etc. |
|
|
Lower 32bits address of the bias data cube |
|
|
Higher 32bits address of the bias data cube when axi araddr is 64bits |
|
|
Line stride of bias data cube |
|
|
Surface stride of bias data cube |
|
|
Stride of bias data cube in batch mode |
|
|
Configuration of NRDMA: enable/disable, data size, Ram type, etc. |
|
|
Lower 32bits address of the bias data cube |
|
|
Higher 32bits address of the bias data cube when axi araddr is 64bits |
|
|
Line stride of bias data cube |
|
|
Surface stride of bias data cube |
|
|
Stride of bias data cube in batch mode |
|
|
Configuration of ERDMA: enable/disable, data size, Ram type, etc. |
|
|
Lower 32bits address of the bias data cube |
|
|
Higher 32bits address of the bias data cube when axi araddr is 64bits |
|
|
Line stride of bias data cube |
|
|
Surface stride of bias data cube |
|
|
Stride of bias data cube in batch mode |
|
|
Operation configuration: flying mode, output destination, Direct or Winograd mode, flush NaN to zero, batch number. |
|
|
RAM type of input data cube |
|
|
Input NaN element number |
|
|
Input Infinity element number |
|
|
Enable/Disable performance counting |
|
|
Count stall cycles of M read DMA for one layer |
|
|
Count stall cycles of B read DMA for one layer |
|
|
Count stall cycles of N read DMA for one layer |
|
|
Count stall cycles of E read DMA for one layer |
Name |
Address |
Description |
---|---|---|
|
|
Idle status of two register groups |
|
|
Pointer for CSB master and data path to access groups |
|
|
LUT access address and type |
|
|
Data register of read or write LUT |
|
|
LUT’s type: exponent or linear. And the selection between LE and LO tables. |
|
|
LE and LO LUT index offset and selection |
|
|
Start of LE LUT’s range |
|
|
End of LE LUT’s range |
|
|
Start of LO LUT’s range |
|
|
End of LO LUT’s range |
|
|
Slope scale parameter for LE LUT underflow and overflow, signed value |
|
|
Slope shift parameter for LE_LUT underflow and overflow, signed value |
|
|
Slope scale parameter for LO LUT underflow and overflow, signed value |
|
|
Slope shift parameter for LO_LUT underflow and overflow, signed value |
|
|
Set it to 1 to kick off operation for current register group |
|
|
Input cube’s width |
|
|
Input cube’s height |
|
|
Input cube’s channel |
|
|
Lower 32bits of output data address |
|
|
Higher 32bits of output data address when axi awaddr is 64bits |
|
|
Line stride of output data cube |
|
|
Surface stride of output data cube |
|
|
Configurations of BS module: bypass, algorithm, etc. |
|
|
Source type and shifter value of BS ALU |
|
|
Operand value of BS ALU |
|
|
Source type and shifter value of BS MUL |
|
|
Operand value of BS MUL |
|
|
Configurations of BN module: bypass, algorithm, etc. |
|
|
Source type and shifter value of BN ALU |
|
|
Operand value of BN ALU |
|
|
Source type and shifter value of BN MUL |
|
|
Operand value of BN MUL |
|
|
Configurations of EW module: bypass, algorithm, etc. |
|
|
Source type and bypass control of EW ALU |
|
|
Operand value of EW ALU |
|
|
Converter offset of EW ALU |
|
|
Converter scale of EW ALU |
|
|
Converter truncate of EW ALU |
|
|
Source type and bypass control of EW MUL |
|
|
Operand value of EW MUL |
|
|
Converter offset of EW MUL |
|
|
Converter scale of EW MUL |
|
|
Converter truncate of EW MUL |
|
|
Truncate of EW |
|
|
Operation configuration: flying mode, output destination, Direct or Winograd mode, flush NaN to zero, batch number. |
|
|
Destination RAM type |
|
|
Stride of output cubes in batch mode |
|
|
Data precision |
|
|
Output converter offset |
|
|
Output converter scale |
|
|
Output converter shifter value |
|
|
Output of equal mode |
|
|
Input NaN element number |
|
|
Input Infinity element number |
|
|
Output NaN element number |
|
|
Enable/Disable performance counting |
|
|
Count stall cycles of write DMA for one layer |
|
|
Element number of both table underflow |
|
|
Element number of both table overflow |
|
|
Element number of both table saturation |
|
|
Element number of both hit, or both miss situation that element underflow one table and at the same time overflow the other. |
|
|
Element number of only LE table hit |
|
|
Element number of only LO table hit |
Name |
Address |
Description |
---|---|---|
|
|
Idle status of two register groups |
|
|
Pointer for CSB master and data path to access groups |
|
|
Set it to 1 to kick off operation for current register group |
|
|
Input data cube’s width |
|
|
Input data cube’s height |
|
|
Input data cube’s channel |
|
|
Indicate source is SDP or external memory |
|
|
Lower 32bits of input data address |
|
|
Higher 32bits of input data address when axi araddr is 64bits |
|
|
Line stride of input cube |
|
|
Surface stride of input cube |
|
|
RAM type of input data cube |
|
|
Input data cube |
|
|
Split number |
|
|
Kernel width and kernel stride |
|
|
Padding width |
|
|
Partial width for first, last and middle partitions |
|
|
Enable/Disable performance counting |
|
|
Element number that for both LUT underflow. |
Name |
Address |
Description |
---|---|---|
|
|
Idle status of two register groups |
|
|
Pointer for CSB master and data path to access groups |
|
|
Set it to 1 to kick off operation for current register group |
|
|
Input data cube’s width |
|
|
Input data cube’s height |
|
|
Input data cube’s channel |
|
|
Output data cube’s width |
|
|
Output data cube’s height |
|
|
Output data cube’s channel |
|
|
Split number |
|
|
Option to flush input NaN to zero |
|
|
Partial width for first, last and middle partitions of input cube |
|
|
Partial width for first, last and middle partitions of output cube |
|
|
Kernel width and kernel stride |
|
|
Reciprocal of pooling kernel width, set to actual value * (2^16) when INT8/INT16 format enabled. and set to actual value for fp16 precision mode with fp17 data format. |
|
|
Reciprocal of pooling kernel height, set to actual value * (2^16) when INT8/INT16 format enabled. and set to actual value for fp16 precision mode with fp17 data format. |
|
|
Left/right/top/bottom padding size |
|
|
Padding_value*1 |
|
|
Padding_value*2 |
|
|
Padding_value*3 |
|
|
Padding_value*4 |
|
|
Padding_value*5 |
|
|
Padding_value*6 |
|
|
Padding_value*7 |
|
|
Lower 32bits of input data address |
|
|
Higher 32bits of input data address when axi araddr is 64bits |
|
|
Line stride of input cube |
|
|
Surface stride of input cube |
|
|
Lower 32bits of output data address |
|
|
Higher 32bits of output data address when axi awaddr is 64bits |
|
|
Line stride of output cube |
|
|
Surface stride of output cube |
|
|
RAM type of destination cube |
|
|
Precision of input data |
|
|
Input infinity element number |
|
|
Input NaN element number |
|
|
Output NaN element number |
|
|
Enable/disable performance counting |
|
|
Counting stalls of write requests |
Name |
Address |
Description |
---|---|---|
|
|
Idle status of two register groups |
|
|
Pointer for CSB master and data path to access groups |
|
|
Set it to 1 to kick off operation for current register group |
|
|
Input data cube’s width |
|
|
Input data cube’s height |
|
|
Input data cube’s channel |
|
|
Lower 32bits of input data address |
|
|
Higher 32bits of input data address when axi araddr is 64bits |
|
|
Line stride of input cube |
|
|
Surface stride of input cube |
|
|
RAM type of input data cube |
|
|
This register is not used in OpenDLA 1.0 |
|
|
Split number |
|
|
Input data cube |
|
|
Enable/Disable performance counting |
|
|
Counting stalls of read requests |
Name |
Address |
Description |
---|---|---|
|
|
Idle status of two register groups |
|
|
Pointer for CSB master and data path to access groups |
|
|
LUT access address and type |
|
|
Data register of read or write LUT |
|
|
LUT’s type: exponent or linear. And the selection between LE and LO tables. |
|
|
LE and LO LUT index offset and selection |
|
|
Lower 32bits of start of LE LUT’s range |
|
|
Higher 6bits of start of LE LUT’s range |
|
|
Lower 32bits of end of LE LUT’s range |
|
|
Higher 6bits of end of LE LUT’s range |
|
|
Lower 32bits of start of LO LUT’s range |
|
|
Higher 6bits of start of LO LUT’s range |
|
|
Lower 32bits of end of LO LUT’s range |
|
|
Higher 6bits of end of LO LUT’s range |
|
|
Slope scale parameter for LE LUT underflow and overflow, signed value |
|
|
Slope shift parameter for LE_LUT underflow and overflow, signed value |
|
|
Slope scale parameter for LO LUT underflow and overflow, signed value |
|
|
Slope shift parameter for LO_LUT underflow and overflow, signed value |
|
|
Set it to 1 to kick off operation for current register group |
|
|
Square sum process bypass control and multiplier after interpolator bypass control |
|
|
Lower 32bits of output data address |
|
|
Higher 32bits of output data address when axi awaddr is 64bits |
|
|
Line stride of output cube |
|
|
Surface stride of output cube |
|
|
RAM type of output data cube |
|
|
This register is not used in OpenDLA 1.0 |
|
|
Precision of input data |
|
|
Option to flush input NaN to zero |
|
|
Normalization length |
|
|
Input data convertor offset |
|
|
Input data convertor scale |
|
|
Input data convertor shifter value |
|
|
Output data convertor offset |
|
|
Output data convertor scale |
|
|
Output data convertor shifter value |
|
|
input NaN element number |
|
|
input Infinity element number |
|
|
output NaN element number |
|
|
saturated element number. |
|
|
Enable/Disable performance counting |
|
|
Element number that for both LUT under-flow |
|
|
Element number that for both LUT under-flow |
|
|
Element number that for both LUT over-flow |
|
|
Element number that for both LUT miss, one is over-flow and the other is overflow |
|
|
Element number that for LE_lut hit only |
|
|
Element number that for LO_lut hit only |
Name |
Address |
Description |
---|---|---|
|
|
Idle status of two register groups |
|
|
Pointer for CSB master and data path to access groups |
|
|
Set it to 1 to kick off operation for current register group |
|
|
Operation mode and precision |
|
|
RAM type of input cube |
|
|
Input data cube’s width and height |
|
|
Input data cube’s channel |
|
|
Higher 32bits of input data address when axi araddr is 64bits |
|
|
Lower 32bits of input data address |
|
|
Line stride of input data cube |
|
|
Surface stride of input data cube |
|
|
Input data planar stride, for merge mode only |
|
|
RAM type of output cube |
|
|
Output data cube’s channel |
|
|
Higher 32bits of output data address when axi awaddr is 64bits |
|
|
Lower 32bits of output data address |
|
|
Line stride of output data cube |
|
|
Input stride for each X step. Equals to (DATAOUT_CHANNEL+1) * BPE / 32 * DAIN_SURF_STRIDE (BPE = (IN_PRECISION == INT8) ? 1 : 2;) |
|
|
Output stride corresponding to each line in input cube. equals to (DECONV_Y_STRIDE+1) * DAOUT_LINE_STRIDE |
|
|
Surface stride of output data cube |
|
|
Output data planar stride, for split mode only |
|
|
Deconvolution x stride and y stride |
|
|
Enable/Disable performance counting |
|
|
RD_STALL Count stall cycles of read DMA for one layer |
|
|
WR_STALL Count stall cycles of write DMA for one layer |