1.. SPDX-License-Identifier: BSD-3-Clause 2 Copyright (c) 2022 Marvell. 3 4Marvell cnxk Machine Learning Poll Mode Driver 5============================================== 6 7The cnxk ML poll mode driver provides support for offloading 8Machine Learning inference operations to Machine Learning accelerator units 9on the **Marvell OCTEON cnxk** SoC family. 10 11The cnxk ML PMD code is organized into multiple files with all file names 12starting with cn10k, providing support for CN106XX and CN106XXS. 13 14More information about OCTEON cnxk SoCs may be obtained from `<https://www.marvell.com>`_ 15 16Supported OCTEON cnxk SoCs 17-------------------------- 18 19- CN106XX 20- CN106XXS 21 22Features 23-------- 24 25The OCTEON cnxk ML PMD provides support for the following set of operations: 26 27Slow-path device and ML model handling: 28 29* Device probing, configuration and close 30* Device start and stop 31* Model loading and unloading 32* Model start and stop 33* Data quantization and dequantization 34 35Fast-path Inference: 36 37* Inference execution 38* Error handling 39 40 41Compilation Prerequisites 42------------------------- 43 44This driver requires external libraries 45to optionally enable support for models compiled using Apache TVM framework. 46The following dependencies are not part of DPDK and must be installed separately: 47 48Jansson 49~~~~~~~ 50 51This library enables support to parse and read JSON files. 52 53DLPack 54~~~~~~ 55 56This library provides headers for open in-memory tensor structures. 57 58.. note:: 59 60 DPDK CNXK ML driver requires DLPack version 0.7 61 62.. code-block:: console 63 64 git clone https://github.com/dmlc/dlpack.git 65 cd dlpack 66 git checkout v0.7 -b v0.7 67 cmake -S ./ -B build \ 68 -DCMAKE_INSTALL_PREFIX=<install_prefix> \ 69 -DBUILD_MOCK=OFF 70 make -C build 71 make -C build install 72 73When cross-compiling, compiler must be provided to CMake: 74 75.. code-block:: console 76 77 -DCMAKE_C_COMPILER=aarch64-linux-gnu-gcc \ 78 -DCMAKE_CXX_COMPILER=aarch64-linux-gnu-g++ 79 80DMLC 81~~~~ 82 83 This is a common bricks library for building scalable 84 and portable distributed machine learning. 85 86.. code-block:: console 87 88 git clone https://github.com/dmlc/dmlc-core.git 89 cd dmlc-core 90 git checkout main 91 cmake -S ./ -B build \ 92 -DCMAKE_INSTALL_PREFIX=<install_prefix> \ 93 -DCMAKE_C_FLAGS="-fpermissive" \ 94 -DCMAKE_CXX_FLAGS="-fpermissive" \ 95 -DUSE_OPENMP=OFF 96 make -C build 97 make -C build install 98 99When cross-compiling, compiler must be provided to CMake: 100 101.. code-block:: console 102 103 -DCMAKE_C_COMPILER=aarch64-linux-gnu-gcc \ 104 -DCMAKE_CXX_COMPILER=aarch64-linux-gnu-g++ 105 106TVM 107~~~ 108 109Apache TVM provides a runtime libraries used to execute models 110on CPU cores or hardware accelerators. 111 112.. note:: 113 114 DPDK CNXK ML driver requires TVM version 0.10.0 115 116.. code-block:: console 117 118 git clone https://github.com/apache/tvm.git 119 cd tvm 120 git checkout v0.11.0 -b v0.11.0 121 git submodule update --init 122 cmake -S ./ -B build \ 123 -DCMAKE_INSTALL_PREFIX=<install_prefix> \ 124 -DBUILD_STATIC_RUNTIME=OFF 125 make -C build 126 make -C build install 127 128When cross-compiling, more options must be provided to CMake: 129 130.. code-block:: console 131 132 -DCMAKE_C_COMPILER=aarch64-linux-gnu-gcc \ 133 -DCMAKE_CXX_COMPILER=aarch64-linux-gnu-g++ \ 134 -DMACHINE_NAME=aarch64-linux-gnu \ 135 -DCMAKE_FIND_ROOT_PATH_MODE_PROGRAM=NEVER \ 136 -DCMAKE_FIND_ROOT_PATH_MODE_LIBRARY=ONLY 137 138TVMDP 139~~~~~ 140 141 Marvell's `TVM Dataplane Library <https://github.com/MarvellEmbeddedProcessors/tvmdp>`_ 142 works as an interface between TVM runtime and DPDK drivers. 143 TVMDP library provides a simplified C interface 144 for TVM's runtime based on C++. 145 146.. note:: 147 148 TVMDP library is dependent on TVM, dlpack, jansson and dmlc-core libraries. 149 150.. code-block:: console 151 152 git clone https://github.com/MarvellEmbeddedProcessors/tvmdp.git 153 cd tvmdp 154 git checkout main 155 cmake -S ./ -B build \ 156 -DCMAKE_INSTALL_PREFIX=<install_prefix> \ 157 -DBUILD_SHARED_LIBS=ON 158 make -C build 159 make -C build install 160 161When cross-compiling, more options must be provided to CMake: 162 163.. code-block:: console 164 165 -DCMAKE_C_COMPILER=aarch64-linux-gnu-gcc \ 166 -DCMAKE_CXX_COMPILER=aarch64-linux-gnu-g++ \ 167 -DCMAKE_FIND_ROOT_PATH=<install_prefix> 168 169libarchive 170~~~~~~~~~~ 171 172Apache TVM framework generates compiled models as tar archives. 173This library enables support to decompress and read archive files 174in tar, xz and other formats. 175 176 177Installation 178------------ 179 180The OCTEON cnxk ML PMD may be compiled natively on an OCTEON cnxk platform 181or cross-compiled on an x86 platform. 182 183In order for Meson to find the dependencies above during the configure stage, 184it is required to update environment variables as below: 185 186.. code-block:: console 187 188 CMAKE_PREFIX_PATH='<install_prefix>/lib/cmake/tvm:<install_prefix>/lib/cmake/dlpack:<install_prefix>/lib/cmake/dmlc' 189 PKG_CONFIG_PATH='<install_prefix>/lib/pkgconfig' 190 191Refer to :doc:`../platform/cnxk` for instructions to build your DPDK application. 192 193 194Initialization 195-------------- 196 197List the ML PF devices available on cn10k platform: 198 199.. code-block:: console 200 201 lspci -d:a092 202 203``a092`` is the ML device PF id. You should see output similar to: 204 205.. code-block:: console 206 207 0000:00:10.0 System peripheral: Cavium, Inc. Device a092 208 209Bind the ML PF device to the vfio_pci driver: 210 211.. code-block:: console 212 213 cd <dpdk directory> 214 usertools/dpdk-devbind.py -u 0000:00:10.0 215 usertools/dpdk-devbind.py -b vfio-pci 0000:00:10.0 216 217 218VDEV support 219------------ 220 221On platforms which don't support ML hardware acceleration through PCI device, 222the Marvell ML CNXK PMD can execute inference operations on a vdev 223with the ML models compiled using Apache TVM framework. 224 225VDEV can be enabled by passing the EAL arguments 226 227.. code-block:: console 228 229 --vdev ml_mvtvm 230 231VDEV can also be used on platforms with ML HW accelerator. 232However to use vdev in this case, the PCI device has to be unbound. 233When PCI device is bound, creation of vdev is skipped. 234 235 236Runtime Config Options 237---------------------- 238 239**Firmware file path** (default ``/lib/firmware/mlip-fw.bin``) 240 241 Path to the firmware binary to be loaded during device configuration. 242 The parameter ``fw_path`` can be used by the user 243 to load ML firmware from a custom path. 244 245 This option is supported only on PCI HW accelerator. 246 247 For example:: 248 249 -a 0000:00:10.0,fw_path="/home/user/ml_fw.bin" 250 251 With the above configuration, driver loads the firmware from the path 252 ``/home/user/ml_fw.bin``. 253 254 255**Enable DPE warnings** (default ``1``) 256 257 ML firmware can be configured during load to handle the DPE errors reported 258 by ML inference engine. 259 When enabled, firmware would mask the DPE non-fatal hardware errors as warnings. 260 The parameter ``enable_dpe_warnings`` is used fo this configuration. 261 262 This option is supported only on PCI HW accelerator. 263 264 For example:: 265 266 -a 0000:00:10.0,enable_dpe_warnings=0 267 268 With the above configuration, DPE non-fatal errors reported by HW 269 are considered as errors. 270 271 272**Model data caching** (default ``1``) 273 274 Enable caching model data on ML ACC cores. 275 Enabling this option executes a dummy inference request 276 in synchronous mode during model start stage. 277 Caching of model data improves the inferencing throughput / latency for the model. 278 The parameter ``cache_model_data`` is used to enable data caching. 279 280 This option is supported on PCI HW accelerator and vdev. 281 282 For example:: 283 284 -a 0000:00:10.0,cache_model_data=0 285 286 With the above configuration, model data caching is disabled on HW accelerator. 287 288 For example:: 289 290 --vdev ml_mvtvm,cache_model_data=0 291 292 With the above configuration, model data caching is disabled on vdev. 293 294 295**OCM allocation mode** (default ``lowest``) 296 297 Option to specify the method to be used while allocating OCM memory 298 for a model during model start. 299 Two modes are supported by the driver. 300 The parameter ``ocm_alloc_mode`` is used to select the OCM allocation mode. 301 302 ``lowest`` 303 Allocate OCM for the model from first available free slot. 304 Search for the free slot is done starting from the lowest tile ID and lowest page ID. 305 ``largest`` 306 Allocate OCM for the model from the slot with largest amount of free space. 307 308 This option is supported only on PCI HW accelerator. 309 310 For example:: 311 312 -a 0000:00:10.0,ocm_alloc_mode=lowest 313 314 With the above configuration, OCM allocation for the model would be done 315 from the first available free slot / from the lowest possible tile ID. 316 317**OCM page size** (default ``16384``) 318 319 Option to specify the page size in bytes to be used for OCM management. 320 Available OCM is split into multiple pages of specified sizes 321 and the pages are allocated to the models. 322 The parameter ``ocm_page_size`` is used to specify the page size to be used. 323 324 Supported page sizes by the driver are 1 KB, 2 KB, 4 KB, 8 KB and 16 KB. 325 Default page size is 16 KB. 326 327 This option is supported only on PCI HW accelerator. 328 329 For example:: 330 331 -a 0000:00:10.0,ocm_page_size=8192 332 333 With the above configuration, page size of OCM is set to 8192 bytes / 8 KB. 334 335 336**Enable hardware queue lock** (default ``0``) 337 338 Option to select the job request enqueue function to use 339 to queue the requests to hardware queue. 340 The parameter ``hw_queue_lock`` is used to select the enqueue function. 341 342 ``0`` 343 Disable (default), use lock-free version of hardware enqueue function 344 for job queuing in enqueue burst operation. 345 To avoid race condition in request queuing to hardware, 346 disabling ``hw_queue_lock`` restricts the number of queue-pairs 347 supported by cnxk driver to 1. 348 ``1`` 349 Enable, use spin-lock version of hardware enqueue function for job queuing. 350 Enabling spinlock version would disable restrictions on the number of queue-pairs 351 that can be supported by the driver. 352 353 This option is supported only on PCI HW accelerator. 354 355 For example:: 356 357 -a 0000:00:10.0,hw_queue_lock=1 358 359 With the above configuration, spinlock version of hardware enqueue function is used 360 in the fast path enqueue burst operation. 361 362**Maximum queue pairs** (default ``1``) 363 364 VDEV supports additional EAL arguments to configure the maximum number 365 of queue-pairs on the ML device through the option ``max_qps``. 366 367 This option is supported only on vdev. 368 369 For example:: 370 371 --vdev ml_mvtvm,max_qps=4 372 373 With the above configuration, 4 queue-pairs are created on the vdev. 374 375 376Debugging Options 377----------------- 378 379.. _table_octeon_cnxk_ml_debug_options: 380 381.. table:: OCTEON cnxk ML PMD debug options 382 383 +---+------------+-------------------------------------------------------+ 384 | # | Component | EAL log command | 385 +===+============+=======================================================+ 386 | 1 | ML | --log-level='pmd\.common\.cnxk\.ml,8' | 387 +---+------------+-------------------------------------------------------+ 388 389 390Extended stats 391-------------- 392 393Marvell cnxk ML PMD supports reporting the device and model extended statistics. 394 395PMD supports the below list of 4 device extended stats. 396 397.. _table_octeon_cnxk_ml_device_xstats_names: 398 399.. table:: OCTEON cnxk ML PMD device xstats names 400 401 +---+---------------------+----------------------------------------------+ 402 | # | Type | Description | 403 +===+=====================+==============================================+ 404 | 1 | nb_models_loaded | Number of models loaded | 405 +---+---------------------+----------------------------------------------+ 406 | 2 | nb_models_unloaded | Number of models unloaded | 407 +---+---------------------+----------------------------------------------+ 408 | 3 | nb_models_started | Number of models started | 409 +---+---------------------+----------------------------------------------+ 410 | 4 | nb_models_stopped | Number of models stopped | 411 +---+---------------------+----------------------------------------------+ 412 413 414PMD supports the below list of 6 extended stats types per each model. 415 416.. _table_octeon_cnxk_ml_model_xstats_names: 417 418.. table:: OCTEON cnxk ML PMD model xstats names 419 420 +---+---------------------+----------------------------------------------+ 421 | # | Type | Description | 422 +===+=====================+==============================================+ 423 | 1 | Avg-HW-Latency | Average hardware latency | 424 +---+---------------------+----------------------------------------------+ 425 | 2 | Min-HW-Latency | Minimum hardware latency | 426 +---+---------------------+----------------------------------------------+ 427 | 3 | Max-HW-Latency | Maximum hardware latency | 428 +---+---------------------+----------------------------------------------+ 429 | 4 | Avg-FW-Latency | Average firmware latency | 430 +---+---------------------+----------------------------------------------+ 431 | 5 | Min-FW-Latency | Minimum firmware latency | 432 +---+---------------------+----------------------------------------------+ 433 | 6 | Max-FW-Latency | Maximum firmware latency | 434 +---+---------------------+----------------------------------------------+ 435 436Latency values reported by the PMD through xstats can have units, 437either in cycles or nano seconds. 438The units of the latency is determined during DPDK initialization 439and would depend on the availability of SCLK. 440Latencies are reported in nano seconds when the SCLK is available and in cycles otherwise. 441Application needs to initialize at least one RVU for the clock to be available. 442 443xstats names are dynamically generated by the PMD and would have the format 444``Model-<model_id>-Type-<units>``. 445 446For example:: 447 448 Model-1-Avg-FW-Latency-ns 449 450The above xstat name would report average firmware latency in nano seconds 451for model ID 1. 452 453The number of xstats made available by the PMD change dynamically. 454The number would increase with loading a model and would decrease with unloading a model. 455The application needs to update the xstats map after a model is either loaded or unloaded. 456