NVDLA Primer ************ Abstract ======== The majority of compute effort for Deep Learning inference is based on mathematical operations that can mostly be grouped into four parts: convolutions; activations; pooling; and normalization. These operations share a few characteristics that make them particularly well suited for special-purpose hardware implementation: their memory access patterns are extremely predictable, and they are readily parallelized. The NVIDIA® Deep Learning Accelerator (NVDLA) project promotes a standardized, open architecture to address the computational demands of inference. The NVDLA architecture is both scalable and highly configurable; the modular design maintains flexibility and simplifies integration. Standardizing Deep Learning acceleration promotes interoperability with the majority of modern Deep Learning networks and contributes to a unified growth of machine learning at scale. NVDLA hardware provides a simple, flexible, robust inference acceleration solution. It supports a wide range of performance levels and readily scales for applications ranging from smaller, cost-sensitive Internet of Things (IoT) devices to larger performance oriented IoT devices. NVDLA is provided as a set of IP-core models based on open industry standards: the Verilog model is a synthesis and simulation model in RTL form, and the TLM SystemC simulation model can be used for software development, system integration, and testing. The NVDLA software ecosystem includes an on-device software stack (part of the open source release), a full training infrastructure to build new models that incorporate Deep Learning, and parsers that convert existing models to a form that is usable by the on-device software. The open source NVDLA project is managed as an open, directed community. NVIDIA welcomes contributions to NVDLA, and will maintain an open process for external users and developers who wish to submit changes back. Contributors are expected to agree to a Contributor License Agreement, ensuring that any IP rights from a contributor are granted to all NVDLA users; users who do not wish to contribute back to NVDLA are under no obligation to do so. After the initial release, development will take place in the open. NVDLA software, hardware, and documentation will be made available through GitHub. NVDLA hardware and software are available under the :doc:`NVIDIA Open NVDLA License `, which is a permissive license that includes a FRAND-RF patent grant. Additionally, for users who build “NVDLA-compatible” implementations which interact well with the greater NVDLA ecosystem, NVIDIA may grant the right to use the “NVDLA” name, or other NVIDIA trademarks. (This licensing description is meant to be informative, not normative; where this information conflicts with the NVDLA license, the NVDLA license supersedes.) Accelerating Deep Learning Inference using NVDLA ================================================ NVDLA introduces a modular architecture designed to simplify configuration, integration and portability; it exposes the building blocks used to accelerate core Deep Learning inference operations. NVDLA hardware is comprised of the following components: * Convolution Core – optimized high-performance convolution engine. * Single Data Processor – single-point lookup engine for activation functions. * Planar Data Processor – planar averaging engine for pooling. * Channel Data Processor – multi-channel averaging engine for advanced normalization functions. * Dedicated Memory and Data Reshape Engines – memory-to-memory transformation acceleration for tensor reshape and copy operations. Each of these blocks are separate and independently configurable. A system that has no need for pooling, for instance, can remove the planar averaging engine entirely; or, a system that needs additional convolutional performance can scale up the performance of the convolution unit without modifying other units in the accelerator. Scheduling operations for each unit are delegated to a co-processor or CPU; they operate on extremely fine-grained scheduling boundaries with each unit operating independently. This requirement for closely-managed scheduling can be made part of the NVDLA sub-system with the addition of a dedicated management coprocessor (“headed” implementation), or this functionality can be fused with the higher-level driver implementation on the main system processor (“headless” implementation). This enables the same NVDLA hardware architecture to serve a variety of implementation sizes. NVDLA hardware utilizes standard practices to interface with the rest of the system: a control channel implements a register file and interrupt interface, and a pair of standard AXI bus interfaces are used to interface with memory. The primary memory interface is intended to connect to the system’s wider memory system, including system DRAM; this memory interface should be shared with the system’s CPU and I/O peripherals. The second memory interface is optional, and allows for a connection to higher-bandwidth memory that may be dedicated to NVDLA or to a computer vision subsystem in general. This option for a heterogeneous memory interface enables additional flexibility for scaling between different types of host systems. The typical flow for inferencing begins with the NVDLA management processor (either a microcontroller in a "headed" implementation, or the main CPU in a "headless" implementation) sending down the configuration of one hardware layer, along with an “activate” command. If data dependencies do not preclude this, multiple hardware layers can be sent down to different engines and activated at the same time (i.e., if there exists another layer whose inputs do not depend on the output from the previous layer). Because every engine has a double-buffer for its configuration registers, it can also capture a second layer’s configuration to begin immediately processing when the active layer has completed. Once a hardware engine finishes its active task, it will issue an interrupt to the management processor to report the completion, and the management processor will then begin the process again. This kind of command-execute-interrupt flow repeats until inference on the entire network is complete. NVDLA implementations generally fall into two categories: * Headless – unit-by-unit management of the NVDLA hardware happens on the main system processor. * Headed – delegates the high-interrupt-frequency tasks to a companion microcontroller that is tightly coupled to the NVDLA sub-system. .. note:: The initial open source release of NVDLA will provide only a software solution for “headless” mode, with “headed” mode drivers to come at a later time. The Small system model in :numref:`fig_system_comparison`, below, shows an example of a headless NVDLA implementation while the Large System model shows a headed implementation. The Small model represents an NVDLA implementation for a more cost-sensitive purpose built device. The Large System model is characterized by the addition of a dedicated control coprocessor and high-bandwidth SRAM to support the NVDLA sub-system. The Large System model is geared more toward high-performance IoT devices that may run many tasks at once. .. _fig_system_comparison: .. figure:: nvdla-primer-system-comparison.svg :alt: "Small" and "Large" NVDLA systems side by side, with SRAMIF disconnected on "small" system, and a microcontroller on "large" system. :align: center Comparison of two possible NVDLA systems. Small NVDLA Model ----------------- The small-NVDLA model opens up Deep Learning technologies in areas where it was previously not feasible. This model is a good fit for cost-sensitive connected Internet of Things (IoT) class devices, AI and automation oriented systems that have well-defined tasks for which cost, area, and power are the primary drivers. Savings (in terms of cost, area, and power) are achieved through NVDLA configurable resources. Neural network models can be pre-compiled and performance optimized, allowing larger models to be “cut down” and reduced in load complexity; this, in turn, enables a scaled down NVDLA implementation where models consume less storage and take less time for system software to load and process. These purpose-built systems typically execute only one task at a time, and as such, sacrificing system performance while NVDLA is operating is generally not a strong concern. The relatively inexpensive context switches associated with these systems -- sometimes, as a result of processor architectural choices, and sometimes, as a result of using a system like FreeRTOS for task management – result in the main processor not being overly burdened by servicing a large number of NVDLA interrupts. This removes the need for an additional microcontroller, and the main processor performs both the coarse-grained scheduling and memory allocation, as well as the fine-grained NVDLA management. Typically, systems following the small-NVDLA model will not include the optional second memory interface. When overall system performance is less of a priority, the impact of not having a high-speed memory path is unlikely to be critical. In such systems, the system memory (usually DRAM) is likely to consume less power than an SRAM, making it more power-efficient to use the system memory as a computation cache. Large NVDLA Model ----------------- The large-NVDLA model serves as a better choice when the primary emphasis is on high performance and versatility. Performance oriented IoT systems may perform inference on many different network topologies; as a result, it is important that these systems maintain a high degree of flexibility. Additionally, these systems may be performing many tasks at once, rather than serializing inference operations, so inference operations must not consume too much processing power on the host. To address these needs, the NVDLA hardware included a second (optional) memory interface for a dedicated high-bandwidth SRAM, and enables the ability to interface with a dedicated control coprocessor (microcontroller) to limit the interrupt load on the main processor. When included in the implementation, a high-bandwidth SRAM is connected to a fast-memory bus interface port on NVDLA. This SRAM is used as a cache by NVDLA; optionally, it may be shared by other high-performance computer-vision-related components on the system to further reduce traffic to the main system memory (Sys DRAM). Requirements for the NVDLA coprocessor are fairly typical; as such, there are many general purpose processors that would be appropriate (e.g., RISC-V-based PicoRV32 processors, ARM Cortex-M or Cortex-R processors, or even in-house microcontroller designs). When using a dedicated coprocessor, the host processor still handles some tasks associated with managing NVDLA. For instance, although the coprocessor becomes responsible for scheduling and fine-grained programming of the NVDLA hardware, the host will remain responsible for coarse-grained scheduling on the NVDLA hardware, for IOMMU mapping of NVDLA memory access (as necessary), for memory allocation of input data and fixed weight arrays on NVDLA, and for synchronization between other system components and tasks that run on NVDLA. Hardware Architecture ===================== The NVDLA architecture can be programmed in two modes of operation: independent mode, and fused mode. * **Independent.** When operating independently, each functional block is configured for when and what it executes, with each block working on its assigned task (akin to independent layers in a Deep Learning framework). Independent operation begins and ends with the assigned block performing memory-to-memory operations, in and out of main system memory or dedicated SRAM memory. * **Fused.** Fused operation is similar to independent operation, however, some blocks can be assembled as a pipeline. This improves performance by bypassing the round trip through memory, instead having blocks communicate with each other through small FIFOs (i.e., the convolution core can pass data to the Single Data Point Processor, which can pass data to the Planar Data Processor, and in turn to the Cross-channel Data Processor). .. _fig_core_diagram: .. figure:: nvdla-primer-core-diagram.svg :alt: "Headless NVDLA core" architectural drawing. A configuration interface block is connected to the outside world through the CSB/interrupt interface. The memory interface block is connected outside with a DBB interface and a second, optional, DBB interface. The memory interface connects to a convolution buffer, which connects to a convolution core; the memory interface also connects to the activation engine, the pooling engine, local response normalization engine, reshape engine, and bridge DMA engine. The convolution core, activation engine, pooling engine, and local response normalization engine also form a pipeline. :align: center Internal architecture of NVDLA core. Connections ----------- NVDLA implements three major connections to the rest of the system: * **Configuration Space Bus (CSB) interface.** This interface is a synchronous, low-bandwidth, low-power, 32-bit control bus designed to be used by a CPU to access the NVDLA configuration registers. NVDLA functions as a slave on the CSB interface. CSB implements a very simple interface protocol so it can be easily converted to AMBA, OCP or any other system bus with a simple shim layer. * **Interrupt interface.** NVDLA hardware includes a 1-bit level-driven interrupt. The interrupt line is asserted when a task has been completed or when an error occurs. * **Data Backbone (DBB) interface.** The DBB interface connects NVDLA and the main system memory subsystems. It is a synchronous, high-speed, and highly configurable data bus. It can be specified to have different address sizes, different data sizes, and to issue different sizes of requests depending upon the requirements of the system. The data backbone interface is a simple interface protocol that is similar to AXI (and can be readily used in AXI-compliant systems). The DBB interface has an optional second connection which can be used when there is a second memory path available. This connection is identical in design to the primary DBB interface and is intended for use with an on-chip SRAM that can provide higher throughput and lower access latency. The second DBB interface is not necessary for NVDLA to function, systems that do not require this memory interface can save area by removing it. Components ---------- Each component in the NVDLA architecture exists to support specific operations integral to inference on deep neural networks. The following descriptions provide a brief functional overview of each block, including the TensorFlow operations that map onto them. While TensorFlow operations were provided as examples, NVDLA hardware supports other Deep Learning frameworks. Convolution ^^^^^^^^^^^ Convolution operations work on two sets of data: one set of offline-trained “weights” (which remain constant between each run of inference), and one set of input “feature” data (which varies with the network’s input). The convolutional engine exposes parameters to map many different sizes of convolutions onto the hardware with high efficiency. The NVDLA convolution engine includes optimizations to improve performance over a naive convolution implementation. Support for sparse weight compression saves memory bandwidth. Built-in Winograd convolution support improves compute efficiency for certain sizes of filters. Batch convolution, can save additional memory bandwidth by reusing weights when running multiple inferences in parallel. To avoid repeated accesses to system memory, the NVDLA convolution engine has an internal RAM reserved for weight and input feature storage, referred to as the “convolution buffer”. This design greatly improves memory efficiency over sending a request to the system memory controller for each independent time a weight or feature is needed. The convolution unit maps onto TensorFlow operations such as ``tf.nn.conv2d``. Single Data Point Processor ^^^^^^^^^^^^^^^^^^^^^^^^^^^ The Single Data Point Processor (SDP) allows for the application of both linear and non-linear functions onto individual data points. This is commonly used immediately after convolution in CNN systems. The SDP has a lookup table to implement non-linear functions, or for linear functions it supports simple bias and scaling. This combination can support most common activation functions, as well as other element-wise operations, including ReLU, PReLU, precision scaling, batch normalization, bias addition, or other complex non-linear functions, such as a sigmoid or a hyperbolic tangent. The SDP maps onto TensorFlow operations including ``tf.nn.batch_normalization``, ``tf.nn.bias_add``, ``tf.nn.elu``, ``tf.nn.relu``, ``tf.sigmoid``, ``tf.tanh``, and more. Planar Data Processor ^^^^^^^^^^^^^^^^^^^^^ The Planar Data Processor (PDP) supports specific spatial operations that are common in CNN applications. It is configurable at runtime to support different pool group sizes, and supports three pooling functions: maximum-pooling, minimum-pooling, and average-pooling. The PDP maps onto the the ``tf.nn.avg_pool``, ``tf.nn.max_pool``, and ``tf.nn.pool`` operations. Cross-channel Data Processor ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The Cross-channel Data Processor (CDP) is a specialized unit built to apply the local response normalization (LRN) function -- a special normalization function that operates on channel dimensions, as opposed to the spatial dimensions. The CDP maps onto the ``tf.nn.local_response_normalization`` function. Data Reshape Engine ^^^^^^^^^^^^^^^^^^^ The data reshape engine performs data format transformations (e.g., splitting or slicing, merging, contraction, reshape-transpose). Data in memory often needs to be reconfigured or reshaped in the process of performing inferencing on a convolutional network. For example, “slice” operations may be used to separate out different features or spatial regions of an image, and “reshape-transpose” operations (common in deconvolutional networks) create output data with larger dimensions than the input dataset. The data reshape engine maps onto TensorFlow operations such as ``tf.nn.conv2d_transpose``, ``tf.concat``, ``tf.slice``, and ``tf.transpose``. Bridge DMA ^^^^^^^^^^ The bridge DMA (BDMA) module provides a data copy engine to move data between the system DRAM and the dedicated high-performance memory interface, where present; this is an accelerated path to move data between these two otherwise non-connected memory systems. Configurability --------------- NVDLA has a wide array of hardware parameters that can be configured to balance area, power, and performance. The following is a short list of these options. * **Data types.** NVDLA natively supports a wide array of data types across its various functional units; a subset of these can be chosen to save area. Data types that can be selected include binary; int4; int8; int16; int32; fp16; fp32; and fp64. * **Input image memory formats.** NVDLA can support planar images, semi-planar images, or other packed memory formats. These different modes can be enabled or disabled to save area. * **Weight compression.** NVDLA has a mechanism to reduce memory bandwidth by sparsely storing convolution weights. This feature can be disabled to save area. * **Winograd convolution.** The Winograd algorithm is an optimization for certain dimensions of convolution. NVDLA can be built with or without support for it. * **Batched convolution.** Batching is a feature that saves memory bandwidth. NVDLA can be built with or without support for it. * **Convolution buffer size.** The convolution buffer is formed of a number of banks. It is possible to adjust the quantity of banks (from 2 to 32) and the size of each bank (from 4 KiB to 8 KiB). (By multiplying these together, it is possible to determine the total amount of convolution buffer memory that will be instantiated.) * **MAC array size.** The multiply-accumulate engine is formed in two dimensions. The width (the “C” dimension) can be adjusted from 8 to 64, and the depth (the “K” dimension) can be adjusted from 4 to 64. (The total number of multiply-accumulates that are created can be determined by multiplying these two together.) * **Second memory interface.** NVDLA can have support for a second memory interface for high-speed accesses, or it can be built with only one memory interface. * **Non-linear activation functions.** To save area, the lookup table that supports nonlinear activation functions (like sigmoid or tanh) can be removed. * **Activation engine size.** The number of activation outputs produced per cycle can be adjusted from 1 through 16. * **Bridge DMA engine.** The bridge DMA engine can be removed to save area. * **Data reshape engine.** The data reshape engine can be removed to save area. * **Pooling engine presence.** The pooling engine can be removed to save area. * **Pooling engine size.** The pooling engine can be adjusted to produce between 1 and 4 outputs per cycle. * **Local response normalization engine presence.** The local response normalization engine can be removed to save area. * **Local response normalization engine size.** The local response normalization engine can be adjusted to produce between 1 and 4 outputs per cycle. * **Memory interface bit width.** The memory interface bit width can be adjusted according to the width of the external memory interface to appropriately size internal buffers. * **Memory read latency tolerance.** Memory latency time is defined as the number of cycles from read request to read data return. The tolerance for this can be adjusted, which impacts the internal latency buffer size of each read DMA engine. Software Design =============== NVDLA has a full software ecosystem supporting it. Part of this ecosystem includes the on-device software stack, a part of the NVDLA open source release; additionally, NVIDIA will provide a full training infrastructure to build new models that incorporate Deep Learning, and to convert existing models to a form that is usable by NVDLA software. In general, the software associated with NVDLA is grouped into two groups: the *compilation tools* (model conversion), and the *runtime environment* (run-time software to load and execute networks on NVDLA). The general flow of this is as shown in the figure below; and each of these is described below. .. _fig_sw_flow: .. figure:: nvdla-primer-sw-flow.svg :alt: DL training software produces a model, which the compilation tool takes and turns into a loadable, which is used by runtime environment. In runtime, UMD submits with ioctl()s to KMD, which is sent to NVDLA with register writes. :scale: 50% :align: center Dataflow diagram inside of NVDLA system software. Compilation Tools: Model Creation and Compilation ------------------------------------------------- Compilation tools include compiler and parser. Compiler is responsible for creating a sequence of hardware layers that are optimized for a given NVDLA configuration; having an optimized network of hardware layers increases performance by reducing model size, load and run times. Compilation is a compartmentalized multi-step process that can be broken down into two basic components: parsing and compiling. The parser can be relatively simple; in its most basic incarnation, it can read a pre-trained Caffe model and create an “intermediate representation” of a network to pass to the next step of compilation. The compiler takes the parsed intermediate representation and the hardware configuration of an NVDLA implementation as its inputs, and generates a network of hardware layers. These steps are performed offline and might be performed on the device that contains the NVDLA implementation. Knowing about the specific hardware configuration of an NVDLA implementation is important, it enables the compiler to generate appropriate layers for the features that are available. For example, this might include selecting between different convolution operation modes (such as Winograd convolution, or basic convolution), or splitting convolution operations into multiple smaller mini-operations depending on the available convolution buffer size. This phase is also responsible for quantizing models to lower precision, such as 8-bit or 16-bit integer, or 16-bit floating point, and for allocating memory regions for weights. The same compiler tool can be used to generate a list of operations for multiple different NVDLA configurations. Runtime Environment: Model Inference on Device ---------------------------------------------- The runtime environment involves running a model on compatible NVDLA hardware. It is effectively divided into two layers: * **User Mode Driver.** The main interface with user-mode programs. After parsing the neural network compiler compiles network layer by layer and converts it into a file format called :term:`NVDLA Loadable`. User mode runtime driver loads this loadable and submits inference job to :ref:`kernel_mode_driver` * **Kernel Mode Driver.** Consists of drivers and firmware that do the work of scheduling layer operations on NVDLA and programming the NVDLA registers to configure each functional block. The runtime execution starts with a stored representation of the network; this stored format is called an “NVDLA loadable” image. In the view of a loadable, each functional block in the NVDLA implementation is represented by a “layer” in software; each layer includes information about its dependencies, the tensors that it uses in as inputs and outputs in memory, and the specific configuration of each block for an operation. Layers are linked together through a dependency graph, which KMD uses to schedule each operation. The format of an NVDLA loadable is standardized across compiler implementations and UMD implementations. All implementations that comply with the NVDLA standard should be able to at least understand any NVDLA loadable image, even if the implementation may not have some features that are required to run inference using that loadable image. UMD has a standard application programming interface (API) for processing loadable images, binding input and output tensors to memory locations, and running inference. This layer loads the network into memory in a defined set of data structures, and passes it to the KMD in an implementation-defined fashion. On Linux, for instance, this could be an ``ioctl()``, passing data from the user-mode driver to the kernel-mode driver; on a single-process system in which the KMD runs in the same environment as the UMD, this could be a simple function call. KMD's main entry point receives an inference job in memory, selects from multiple available jobs for scheduling (if on a multi-process system), and submits it to the core engine scheduler. This core engine scheduler is responsible for handling interrupts from NVDLA, scheduling layers on each individual functional block, and updating any dependencies for that layer based upon the completion of a task from a previous layer. The scheduler uses information from the dependency graph to determine when subsequent layers are ready to be scheduled; this allows the compiler to decide scheduling of layers in an optimized way, and avoids performance differences from different implementations of KMD. .. _fig_portability_layer: .. figure:: nvdla-primer-portability-layer.svg :alt: Inside of the user application software and OS kernel, there is a portability layer, which wraps the DLA core code from the NVDLA GitHub. :align: center Portability layers in the NVDLA system. Both the UMD stack and the KMD stack exist as defined APIs, and are expected to be wrapped with a system portability layer. Maintaining core implementations within a portability layer is expected to require relatively few changes and expedite any effort where it may be necessary to run an NVDLA software-stack on multiple platforms; with the appropriate portability layers in place, the same core implementations should compile as readily on both Linux and FreeRTOS. Similarly, on “headed” implementations that have a microcontroller closely coupled to NVDLA, the existence of the portability layer makes it possible to run the same low-level software on the microcontroller as would run on the main CPU in a “headless” implementations that has no such companion processor. NVDLA System Integration ======================== NVDLA can be configured for a wide range of performance levels; choosing these parameters depends on the requirements for Convolutional Neural Network(s) (CNN) that will be executed. This section describes some of the factors that will influence the choice of these parameters, and some considerations of their impact on system area and performance. The time required to run each layer is the maximum amount of the time required for data input, output, and the time required to perform the multiply-accumulate (MAC) operations. The time required to run the whole network is equal to the sum of times for all the layers. Choosing the correct number of MAC units, the convolutional buffer size, and the on-chip SRAM size for the desired performance are the most critical steps in sizing. NVDLA has many more configuration parameters for additional performance tuning that require careful consideration, these will have less impact on the total area; they should be configured to not become unnecessary bottlenecks. Tuning Questions ---------------- What math precision is required for the workloads expected for any given instantiation? ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The bulk of the NVDLA area in larger configurations is used by convolution buffers and by MAC units, and so it stands to reason that these parameters are the most important in an initial performance / area tradeoff analysis. Deep Learning training is usually done at 32-bit floating point precision, but the resulting networks can often be reduced to 8-bit integers without significant loss of inference quality; in some cases, however, it may still be desirable to use 16-bit integers or floating point numbers. What are the number of MAC units, and the required memory bandwidth? ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ After precision, the next two critical parameters for performance and area are the number of MAC units, and the required memory bandwidth. When configuring NVDLA, these should be carefully considered. Processing happens layer-by-layer, and so performance estimation is best done layer-by-layer, as well. For any given layer, it is usually the case that either MAC throughput or memory bandwidth will be the bottleneck. The number of MAC units required is relatively easy to determine. For example, a convolutional layer has a known input and output resolution, and a known number of input and output features; the convolution kernel size is also known. Multiplying these together gives the total number of MAC operations to process the layer. The hardware can be defined to have a certain number of MAC units; dividing the number of operations required by the number of MAC units gives a lower bound for the number of clock cycles that a layer can be processed in. Calculating required memory bandwidth is less trivial. In the ideal case, it should only be necessary to read the input image once, the output image once, and the weights once, and the minimum number of cycles will be the sum of those divided by the number of samples that can be read or written per clock. However, if the convolutional buffer is too small to hold the support region for the input and the set of weights, multiple passes are required. For example, if the convolutional buffer can only hold a fourth of the weight data, then the calculation must be split into four steps, multiplying the input bandwidth (i.e., 10MB of input memory traffic would multiply to 40MB). Similarly, if the buffers cannot hold enough lines for a support region for the convolution, the convolution must also be broken up into horizontal strips. This effect is important to consider when choosing the convolutional buffer size, and when sizing the memory interface. Is there a need for on-chip SRAM? ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ If external memory bandwidth is at a premium for power or performance reasons, then adding on-chip SRAM can help. Such SRAM can be thought of as a second-level cache; it can have higher bandwidth than the main memory, and that bandwidth is additive to the main memory bandwidth. An on-chip SRAM is less expensive to implement than a larger convolutional buffer, which needs wide ports and has very stringent timing requirements, but does not have as greatly multiplicative of a factor in applications that are convolutional-buffer-limited. (For instance, if a layer is bandwidth limited, adding a SRAM that is sufficient to hold the entire input image that runs at twice the speed of the system’s DRAM can double the performance. However, if the layer is also limited by convolutional buffer size, the same amount of memory could produce a much greater multiplier to system throughput.) The simplest way to consider this tradeoff is that adding convolutional buffer size will help to reduce the bandwidth requirement, while adding an on-chip SRAM can improve the total available bandwidth. Example Area and Performance with NVDLA --------------------------------------- The following table provides estimates for NVDLA configurations optimized for the popular ResNet-50 neural network. The area figures given are estimated synthesis area, and include all memories required; real area results will vary based on foundry and libraries. In this example, no on-chip SRAM is used. On-chip SRAM would be beneficial if available SDRAM bandwidth is low. The open-source release of NVDLA has an performance estimator tool available to explore the space of NVDLA designs, and the impact on performance. Power and performance in the following table are shown for a 1GHz frequency. Power and performance for a given configuration can be varied though adjustment of voltage and frquency. +--------+--------------+------------------+--------------+---------------+---------------+----------------+ | # MACs | | | Silicon Cell | Silicon Cell |Int8 ResNet-50 | Power Estimate | | | Conv. buffer | SDRAM | Area | Area |(frames/sec) | Peak/Average | | | size (KB) | bandwidth (GB/s) | (mm^2, 28nm) | (mm^2, 16nm) | | (mW, 16nm) | +========+==============+==================+==============+===============+===============+================+ | 2048 | 512 | 20 | 5.5 | 3.3 | 269 | 766 / 291 | +--------+--------------+------------------+--------------+---------------+---------------+----------------+ | 1024 | 256 | 15 | 3.0 | 1.8 | 153 | 375 / 143 | +--------+--------------+------------------+--------------+---------------+---------------+----------------+ | 512 | 256 | 10 | 2.3 | 1.4 | 93 | 210 / 80 | +--------+--------------+------------------+--------------+---------------+---------------+----------------+ | 256 | 256 | 5 | 1.7 | 1.0 | 46 | 135 / 48 | +--------+--------------+------------------+--------------+---------------+---------------+----------------+ | 128 | 256 | 2 | 1.4 | 0.84 | 20 | 82 / 31 | +--------+--------------+------------------+--------------+---------------+---------------+----------------+ | 64 | 128 | 1 | 0.91 | 0.55 | 7.3 | 55 / 21 | +--------+--------------+------------------+--------------+---------------+---------------+----------------+ | 32 | 128 | 0.5 | 0.85 | 0.51 | 3.6 | 45 / 17 | +--------+--------------+------------------+--------------+---------------+---------------+----------------+ Sample Platforms ---------------- Sample platforms are provided which allow users to observe, evaluate, and test NVDLA in a minimal SoC environment. A minimum SoC system configuration consists of a CPU, an NVDLA instance, an interconnect, and memories. These platforms can be used for software development, or as a starting point for integrating NVDLA into an industrial-strength SoC. Simulation ^^^^^^^^^^ The NVDLA open source release includes a simulation platform based on GreenSocs QBox. In this platform, a QEMU CPU model (x86 or ARMv8) is combined with the NVDLA SystemC model, providing a register-accurate system on which software can be quickly developed and debugged. The Linux kernel-mode driver and a user-mode test utility are provided to run on this simulation platform. FPGA ^^^^ This sample platform maps the NVDLA Verilog model onto an FPGA, it provides a synthesizable example of instantiating NVDLA in a real design. In this platform, the NVDLA SystemC model is not used, software register reads and writes execute directly on the real RTL environment. This allows for limited cycle-counting performance evaluation, and also allows for even faster testing of software against larger, more complex networks. The FPGA model is intended for validation only, no effort has been made to optimize cycle time, design size, or power for the FPGA platform, performance of the FPGA model is not directly comparable against other FPGA-based Deep Learning accelerators The FPGA system model uses the Amazon EC2 “F1” environment, which is a publicly available standardized FPGA system that can be leased by the hour. No up-front purchase of specialized hardware or software is necessary to use this model; the synthesis software is available for only the cost of compute time on the Amazon EC2 environment, and the hardware requires no commitment to gain access to. Because the FPGA platform is Xilinx-based, migration to other Virtex-family devices should be relatively straightforward. Models ------ NVDLA IP-core models are based on open industry standards. The simplistic design and use of basic constructs are expected to easily integrate in typical SoC design flows. Verilog model ^^^^^^^^^^^^^ The Verilog model provides a synthesis and simulation model in RTL form. It has four functional interfaces: a slave host interface, an interrupt line, and two master interfaces for internal and external memory access. The host and memory interfaces are very simple, but require external bus adapters to connect to an existing SoC design; for convenience, sample adapters for AXI4 and TileLink are included as part of the NVDLA open source release. The NVDLA open source release contains example synthesis scripts. To facilitate physical design on more complex systems or larger instantiations of NVDLA, the design is split into partitions that each can be handled independently in the SoC backend flow. The interfaces between the partitions can be retimed as needed to meet routing requirements. The NVDLA core operates in a single clock domain; bus adapters allow for clock domain crossing from the internal NVDLA clock to the bus clocks. Similarly, NVDLA also operates in a single power domain; the design applies both fine- and coarse-grain power gating. If added to implementation, SRAMs are modelled by behavioral models and must be replaced by compiled RAMs in a full SoC design. The NVDLA design requires implementations of both single-ported and dual-ported (one read port plus one write port) SRAMs. Simulation model and verification suite ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ NVDLA includes a TLM2 SystemC simulation model for software development, system integration, and testing. This model enables much faster simulation than would otherwise be available by running the RTL in conjunction with signal-stimulus models. This SystemC model is intended to be used in full-SoC simulation environments, such as Synopsys VDK or the provided GreenSocs QBox platform. The included model is parameterizable on the same axes as is the RTL model, for direct comparison and simulation. The simulation model can also be used with the NVDLA testbench and verification suite. The light-weight trace-player-based testbench is suitable for simple synthesis and build health verification (this will be available with the initial NVDLA release). A full verification environment with extensive unit-by-unit testing will become available in subsequent release. The verification suite can be used to provide design assurance before tape-out, including verifying changes for compiled RAMs, clock-gating, and scan-chain insertion. This environment will be suitable for making more substantial changes (e.g., verify new NVDLA configurations or modifications made to an existing NVDLA design). Software -------- The initial NVDLA open-source release includes software for a “headless” implementation, compatible with Linux. Both a kernel-mode driver and a user-mode test utility are provided in source form, and can run on top of otherwise-unmodified Linux systems. Appendix: Deep Learning references ================================== This document assumes some amount of familiarity with general concepts pertaining to Deep Learning and Convolutional Neural Networks. The following links have been provided as a means to begin or further an individual's investigation into these topics, as needed. * `NVIDIA Webinar Series: Deep Learning Demystified `_ * A Beginner's Guide to Understanding Convolutional Neural Networks * `Part 1 `_ * `Part 2 `_ * `Inference: The Next Step in GPU-Accelerated Deep Learning `_ * `NVIDIA Whitepaper: GPU-Based Deep Learning Inference: A performance and Power Analysis `_ * `Fundamentals of Deep Learning: What’s the Difference between Deep Learning Training and Inference? `_ * `Inception (GoogLeNet): Going Deeper with Convolutions `_ * `Microsoft ResNet: Deep Residual Learning for Image Recognition `_ * `AlexNet: ImageNet Classification with Deep Convolutional Neural Networks `_ * `VGG Net: Very Deep Convolutional Networks for Large-Scale Image Recognition `_