Runtime environment

../_images/runtime_environment.png

The runtime envionment includes software to run a compiled neural network on compatible NVDLA hardware. It consists of 2 parts:

  • User Mode Driver - This is the main interface to the application. As detailed in the Compilation tools, after parsing and compiling the neural network layer by layer, the compiled output is stored in a file format called NVDLA Loadable. User mode runtime driver loads this loadable and submits inference jobs to the Kernel Mode Driver.
  • Kernel Mode Driver - Consists of kernel mode driver and engine scheduler that does the work of scheduling the compiled network on NVDLA and programming the NVDLA registers to configure each functional block.

The runtime environment uses the stored representation of the network saved as NVDLA Loadable image. From point of view of the NVDLA Loadable, each compiled “layer” in software is loadable on a functional block in the NVDLA implementation. Each layer includes information about its dependencies on other layers, the buffers that it uses for inputs and outputs in memory, and the specific configuration of each functional block used for its execution. Layers are linked together through a dependency graph, which the engine scheduler uses for scheduling layers. The format of an NVDLA Loadable is standardized across compiler implementations and UMD implementations. All implementations that comply with the NVDLA standard should be able to at least interpret any NVDLA Loadable image, even if the implementation may not have some features that are required to run inferencing using that loadable image.

Both the User Mode Driver stack and the Kernel Mode Driver stack exist as defined APIs, and are expected to be wrapped with a system portability layer. Maintaining core implementations within a portability layer is expected to require relatively few changes. This expedites any effort that may be necessary to run the NVDLA software-stack on multiple platforms. With the appropriate portability layers in place, the same core implementations should compile as readily on both Linux and FreeRTOS. Similarly, on “headed” implementations that have a microcontroller closely coupled to NVDLA, the existence of the portability layer makes it possible to run the same kernel mode driver on that microcontroller as would have run on the main CPU in a “headless” implementation that had no such companion microcontroller.

User Mode Driver

../_images/umd.png

UMD provides standard Application Programming Interface (API) for processing loadable images, binding input and output tensors to memory locations, and submitting inference jobs to KMD. This layer loads the network into memory in a defined set of data structures, and passes it to the KMD in an implementation-defined fashion. On Linux, for instance, this could be an ioctl(), passing data from the user-mode driver to the kernel-mode driver; on a single-process system in which the KMD runs in the same environment as the UMD, this could be a simple function call. Low-level functions are implemented in User Mode Driver

Application Programming Interface

NVDLA namespace

type NvError

Enum for error codes

Runtime Interface

This is the interface for runtime library. It implements functions to process loadable buffer passed from application after reding it from file, allocate memory for tensors and intermediate buffers, prepare synchronization points and finally submit inference job to KMD. Inference job submitted to KMD is referred as DLA task.

class nvdla::IRuntime

Runtime interface

nvdla::IRuntime *nvdla::createRuntime()

Create runtime instance

Returns:IRuntime object

Device information

int nvdla::IRuntime::getMaxDevices()

Get maximum number of device supported by HW configuration. Runtime driver supports submitting inference jobs to multiple DLA devices. User application can select device to use. One task can’t splitted across devices but one task can be submitted to only one devices.

Returns:Maximum number of devices supported
int nvdla::IRuntime::getNumDevices()

Get number of available devices from the maximum number of devices supported by HW configuration.

Returns:Number of available devices

Loading NVDLA loadable image

NvError nvdla::IRuntime::load(const NvU8 *buf)

Parse loadable from buffer and update ILoadable with information required to create task

Parameters:buf – Loadable image buffer
Returns:NvError

Input tensors

NvError nvdla::IRuntime::getNumInputTensors(int *input_tensors)

Get number of network’s input tensors from loadable

Parameters:input_tensors – Pointer to update number of input tensors value
Returns:NvError
NvError nvdla::IRuntime::getInputTensorDesc(int id, nvdla::ILoadable::TensorDescListEntry *tensors)

Get network’s input tensor descriptor

Parameters:
  • id – Tensor ID
  • tensors – Tensor descriptor
Returns:

NvError

NvError nvdla::IRuntime::setInputTensorDesc(int id, const nvdla::ILoadable::TensorDescListEntry *tensors)

Set network’s input tensor descriptor

Parameters:
  • id – Tensor ID
  • tensors – Tensor descriptor
Returns:

NvError

Output tensors

NvError nvdla::IRuntime::getNumOutputTensors(int *output_tensors)

Get number of network’s output tensors from loadable

Parameters:output_tensors – Pointer to update number of output tensors value
Returns:NvError
NvError nvdla::IRuntime::getOutputTensorDesc(int id, nvdla::ILoadable::TensorDescListEntry *tensors)

Get network’s output tensor descriptor

Parameters:
  • id – Tensor ID
  • tensors – Tensor descriptor
Returns:

NvError

NvError nvdla::IRuntime::setOutputTensorDesc(int id, const nvdla::ILoadable::TensorDescListEntry *)

Set network’s output tensor descriptor

Parameters:
  • id – Tensor ID
  • tensors – Tensor descriptor
Returns:

NvError

Binding tensors

NvError nvdla::IRuntime::bindInputTensor(int id, NvDlaMemHandle hMem)

Bind network’s input tensor to memory handle

Parameters:
  • id – Tensor ID
  • hMem – DLA memory handle returned by NvDlaGetMem()
Returns:

NvError

NvError nvdla::IRuntime::bindOutputTensor(int id, NvDlaMemHandle hMem)

Bind network’s output tensor to memory handle

Parameters:
  • id – Tensor ID
  • hMem – DLA memory handle returned by NvDlaGetMem()
Returns:

NvError

Running inference

NvError nvdla::IRuntime::submit()

Submit task for inference, it is blocking call

Returns:NvError
NvError nvdla::IRuntime::submitNonBlocking(std::vector<ISync *> *outputSyncs)

Submit non-blocking task for inference

Parameters:outputSyncs – List of output ISync objects
Returns:NvError

Sync Interface

Sync interface is used to synchronize between inference tasks. Software implementation can add more synchronization primitives of choice to the ISync wrapper.

class nvdla::ISync

Sync interface

NvError wait(NvU32 timeout)

Blocks the caller until ISync object has signaled, or the timeout expires

Parameters:timeout – timeout The timeout value, in milliseconds
Returns:NvError
NvError signal()

Requests the ISync object to be signaled

Returns:NvError
NvError nvdla::ISync::setWaitValue(NvU32 val)

Set the comparison value for determining synchronization

Parameters:val – Comparison value to set
Returns:NvError
NvError nvdla::ISync::getWaitValue(NvU32 *val) const

Get the comparison value for determining synchronization

Parameters:val – Pointer to read wait value
Returns:NvError
NvError nvdla::ISync::getValue(NvU32 *val) const

Get the current value of the semaphore

Parameters:val – Pointer to read current value
Returns:NvError

Loadable Interface

Loadable contains compiled network and model data converted to DLA format. This interface implements functions to read data from loaded image.

class nvdla::ILoadable

Loadable interface

class nvdla::ILoadable::TensorDescListEntry

Portability layer

NvDlaEngineSelect

Implementation defined enum to select device instance if there are multiple DLA devices

NvDlaHandle

Implementation defined handle used to communicate with portability layer from Runtime driver. NvDlaOpen() allocates this handle and returns to Runtime driver.

NvDlaMemHandle

Implementation defined memory handle used for memory operations implemented by portability layer. NvDlaGetMem() allocates and returns this handle to runtime driver for future operation on memory buffer allocated

NvDlaTask

DLA task structure. Runtime driver populates it using information from loadable and is used by portability layer to submit inference task to KMD in an implementation define manner.

NvDlaFence

Implementation defined sync object descriptor. It is populated by runtime driver in ISync implementation. This object is used by portability layer to send sync object information to KMD.

NvDlaTaskStatus

Task status structure used to report the task status to runtime.

NvDlaHeap

Implementation defined enum for memory heaps supported by the system.

NvError

Enum for error codes

NvError NvDlaOpen(NvDlaEngineSelect DlaSelect, NvDlaHandle *phDla)

This API should initialize portability layer which includes opening DLA device instance, allocating required structures, initializing session. It is implementation defined how integrator wants to implement portability layer. It should allocate NvDlaHandle and update phDla with it. This handle will be used for any future requests to portability layer for this session such as memory allocation, memory mapping or submit inference job.

Parameters:
  • DlaSelect – Engine to initialize
  • phDla – DLA handle updated if initializaton is successful
Returns:

NvError

void NvDlaClose(NvDlaHandle hDla)

Close DLA device instance

Parameters:
NvError NvDlaSubmit(NvDlaHandle hDla, NvDlaTask *tasks, NvU32 num_tasks)

Submit inference task to KMD

Parameters:
  • hDla – DLA handle returned by NvDlaOpen()
  • tasks – Lists of tasks to submit for inferencing
  • num_tasks – Number of tasks to submit
Returns:

NvError

NvError NvDlaGetMem(NvDlaHandle hDla, NvDlaMemHandle *handle, void **pData, NvU32 size, NvDlaHeap heap, NvU32 flags)

Allocate, pin and map DLA engine accessible memory. For example, in case of systems where DLA is behind IOMMU then this call should ensure that IOMMU mappings are created for this memory. In case of Linux, internal implementation can use readily available frameworks such as ION for this.

Parameters:
  • hDla – DLA handle returned by NvDlaOpen()
  • handle ([out]) – Memory handle updated by this function
  • size – Size of buffer to allocate
  • pData – If the allocation and mapping is successful, provides a virtual address through which the memory buffer can be accessed.
  • heap – Implementation defined memory heap selection
  • flags – Implementation defined
Returns:

NvError

NvError NvDlaFreeMem(NvDlaHandle hDla, NvDlaMemHandle handle, void *pData, NvU32 size)

Free DMA memory allocated using NvDlaGetMem()

Parameters:
  • hDla – DLA handle returned by NvDlaOpen()
  • handle – Memory handle returned by NvDlaGetMem()
  • pData – Virtual address returned by NvDlaGetMem()
  • size – Size of the buffer allocated
Returns:

NvError

NvError NvDlaMemMap(NvDlaMemHandle hMem, NvU32 Offset, NvU32 Size, NvU32 Flags, void **pVirtAddr)

Attempts to map a memory buffer into the process’s virtual address space.

Parameters:
  • hMem – A memory handle returned from NvDlaGetMem()
  • Offset – Byte offset within the memory buffer to start the map at.
  • Size – Size in bytes of mapping requested. Must be greater than 0.
  • Flags – Implementation defined
  • pVirtAddr – If the mapping is successful, provides a virtual address through which the memory buffer can be accessed.
Returns:

NvError

void NvDlaMemUnmap(NvDlaMemHandle hMem, void *pVirtAddr, NvU32 length)

Unmaps a memory buffer from the process’s virtual address space.

Parameters:
  • hMem – A memory handle returned from NvDlaGetMem()
  • pVirtAddr – The virtual address returned by a previous call to NvDlaMemMap() with hMem.
  • length – The size in bytes of the mapped region. Must be the same as the size value originally passed to NvDlaMemMap().
void NvDlaMemRead(NvDlaMemHandle hMem, NvU32 Offset, void *pDst, NvU32 Size)

Reads a block of data from a buffer.

Parameters:
  • hMem – A memory handle returned from NvDlaGetMem()
  • Offset – Byte offset relative to the base of hMem.
  • pDst – The buffer where the data should be placed.
  • Size – The number of bytes of data to be read.
void NvDlaMemWrite(NvDlaMemHandle hMem, NvU32 Offset, const void *pSrc, NvU32 Size)

Writes a block of data to a buffer

Parameters:
  • hMem – A memory handle returned from NvDlaGetMem()
  • Offset – Byte offset relative to the base of hMem.
  • pSrc – The buffer to obtain the data from.
  • Size – The number of bytes of data to be written.
NvU64 NvDlaMemGetSize(NvDlaMemHandle hMem)

Get the size of the buffer associated with a memory handle

Parameters:
Returns:

Size in bytes of memory allocated for this handle or 0 in case of error.

void NvDlaDebugPrintf(const char *format, ...)

Outputs a message to the debugging console, if present.

Parameters:
  • format – A pointer to the format string
void *NvDlaAlloc(size_t size)

Dynamically allocates memory. Alignment, if desired, must be done by the caller.

Parameters:
  • size – The size of the memory to allocate
void NvDlaFree(void *ptr)

Frees a dynamic memory allocation. Freeing a null value is okay

Parameters:
  • ptr – A pointer to the memory to free, which should be from NvDlaAlloc().
void NvDlaSleepMS(NvU32 msec)

Unschedule calling thread for at least the given number of milliseconds. Other threads may run during the sleep time.

Parameters:
  • msec – The number of milliseconds to sleep.
NvU32 NvDlaGetTimeMS(void)

Return the system time in milliseconds. The returned values are guaranteed to be monotonically increasing, but may wrap back to zero (after about 50 days of runtime). In some systems, this is the number of milliseconds since power-on, or may actually be an accurate date.

Returns:System time in milliseconds

Kernel Mode Driver

../_images/kmd.png

The KMD main entry point receives an inference job in memory, selects from multiple available jobs for execution (if on a multi-process system), and submits it to the core engine scheduler. This core engine scheduler is responsible for handling interrupts from NVDLA, scheduling layers on each individual functional block, and updating any dependencies based upon the completion of the layer. The scheduler uses information from the dependency graph to determine when subsequent layers are ready to be scheduled; this allows the compiler to decide scheduling of layers in an optimized way, and avoids performance differences from different implementations of the KMD.

../_images/kmd_interface.png

Interface

dla_task_descriptor

Task descriptor structure. This structure includes all the information required to execute a network such as number of layers, dependency graph address etc.

int dla_execute_task(struct dla_task_descriptor *task);

Task is submitted to engine scheduler for execution. Engine scheduler initiates all functional blocks if a layer is present for that functional block. Driver should process all the events after this and wait till all layers are completed.

Parameters:
  • task – Task descriptor
Returns:

0 if success otherwise error code

int dla_engine_init(void *priv_data)

Initialize engine scheduler. This function should be called when driver is probed at boot time.

Parameters:
  • priv_data – Any data which is required back when scheduler engine calls function implemented in portability layer
Returns:

0 if success otherwise error code

int dla_process_events(void);

Process events recorded in interrupt handler. This function must be called from thread/process context and not from interrupt context. It reads the events recorded in interrupt handler, updates dependency using events information and programs next layers in network. Driver should call this function immediately after handling interrupt and it’s execution must be atomic, protected by some lock.

Returns:0 if success otherwise error code
int dla_isr_handler(void);

Interrupt handler records events by reading interrupt status registers. Driver should call this function from OS interrupt handler and it’s execution must be atomic, protected by some lock.

Returns:0 if success otherwise error code

Portability layer

Driver should implement below functions which are called from engine scheduler.

Register read/write

void dla_reg_write(uint32_t addr, uint32_t reg)

Register write. This function implementation depends on how is DLA accessible from CPU. It should take care of adding base address to addr.

Parameters:
  • addr – Register offset starting from 0 as base address
  • reg – Value to write
uint32_t dla_reg_read(uint32_t addr)

Register read. This function implementation depends on how is DLA accessible from CPU. It should take care of adding base address to addr

Parameters:
  • addr – Register offset starting from 0 as base address
Returns:

Register value

Read address

int32_t dla_read_dma_address(struct dla_task_desc *task_desc, int16_t index, void *dst)

Read DMA address from address list at index specified. This function is used by functional block programming operations to read address for DMA engines in functional blocks.

Parameters:
  • task_desc – Task descriptor for in execution task
  • index – Index in address list
  • dst – Destination pointer to update address
Returns:

0 in case success, error code in case of failure

int32_t dla_read_cpu_address(struct dla_task_desc *task_desc, int16_t index, void *dst)

Read CPU accessible address from address list at index specified. This function is used by engine scheduler to read data from memory buffer. Address returned by this function must be accessible by processor running engine scheduler.

Parameters:
  • task_desc – Task descriptor for in execution task
  • index – Index in address list
  • dst – Destination pointer to update address
Returns:

0 in case success, error code in case of failure

Data read/write

int32_t dla_data_read(uint64_t src, void* dst, uint32_t size, uint32_t offset)

Read data from src buffer to dst. Here src is memory buffer shared by UMD and dst is local structure in KMD.

Parameters:
  • src – Source address to read data from
  • dst – Destination to write data data
  • size – Size of data to read
  • offset – Offset from source address to read data
Returns:

0 in case success, error code in case of failure

int32_t dla_data_write(void* src, uint64_t dst, uint32_t size, uint32_t offset)

Write data from src to dst. Here src is local structure in KMD and dst is memory buffer shared by UMD.

Parameters:
  • src – Source to read data
  • dst – Destination address to write data
  • size – Size of data to write
  • offset – Offset from destination address to write data
Returns:

0 in case success, error code in case of failure