Google從Android 15 (Android-V)開始將淘汰NNAPI改推LiteRT (TensorFlow Lite)

前言

在幾年前的文章中曾經介紹過關於如何使用NNAPI來在設備上運行AI模型

延伸閱讀: Neural Networks API Introduction: Android APP背後是如何執行一個神經網路模型的?

在Google在最新的NNAPI文檔中宣布隨著Android 15(V)的推出即將淘汰NNAPI，未來將主推LiteRT (TensorFlow Lite)。本文摘要整理相關資料並描述Google這麼做的原因為何。

NNAPI will be Deprecated with the release of Android 15 (Android-V)

根據NNAPI的google最新document描述到:

Warning: With the release of Android 15, NNAPI will be deprecated. While you can continue to use NNAPI, we expect the majority of devices in the future to use the CPU backend, and therefore for performance critical workloads, we recommend migrating to alternative solutions, for example the TF Lite GPU runtime.

For more information, see the NNAPI Migration Guide.

而在NNAPI migration guide中則說明了原因:

自從NNAPI release後, 由於ODML (On Device Machine Learning)的快革新, 如transformer, diffusion等model的快速進版, 開發者需要頻繁的更新基礎工具和開發工具
為了滿足這些需求, Google提供了TensorFlow Lite (LiteRT) in Play Services, 為這些AI模型提供了可更新的TensorFlow runtime服務

並且提及在Android 15 (Android-V)後, NNAPI將被標記為棄用(deprecated)並建議開發者改採用TensorFlow Lite (LiteRT) in Play Services

LiteRT = TensorFlow Lite Runtime

LiteRT (TensorFlow Lite RunTime) 其實就是大家熟悉的TensorFlow Lite, 只是改名了

LiteRT，或TensorFlow Lite，是Google專為設備上的AI設計的執行環境。支援將常用的AI訓練框架 (TensorFlow, Pytorch, JAX…)轉換成tflite格式並執行。

主要的一些key feature有:

Optimized for on-device machine learning: 針對ODML的一些限制進行了優化
1. Latency: 資料不須來回server的時間
2. Privacy: 資料不離開終端設備
3. Connectivity: 不須連網
4. Size: 模型會被縮小 & 用二進位儲存
5. Power consumption: 更有效率的inference
Multi-platform support: 相容Android, iOS, embedded Linux & MCU等平台
Multi-framework model options: 支援將多種模型格式轉換成FlatBuffers格式(.tflite)
Diverse language support: 支援多種語言的開發SDK(Java/Kotlin, Swift, Objective-C, C++, Python)
High performance: 支援HW delegates, 能夠將AI模型跑在GPU or NPU上

Hardware Acceleration of LiteRT

LiteRT的主要用途是用在終端設備，那對於搭載特殊HW可以用來加速AI運算的平台是如何做到加速的?

在document中有一個section就在針對hardware acceleration這件事提供了幾個topic，下面簡單整理一下。

LiteRT Delegates

Delegates，中文譯成代理，在LiteRT上的意思是讓平台上的AI acceleration hardware可以被用來做AI運算的加速，如GPU, TPU, NPU…然而，考慮到以下諸多面向，工程師在終端平台上要有效率的應用hardware進行AI運算加速變得棘手:

儘管預設LiteRT是使用ARM Neon的指令集來加速CPU，但CPU本質上仍然不是適合做AI運算的hardware。
對於GPU / NPU這種hardware來說，雖然因為硬體特性適合進行AI加速運算，但你可能需要根據其硬體的特性來撰寫相對應的kernel (e.g. OpenCL / OpenGL for GPU)
再來不同hardware的特性不同，不是所有OP都適合放在同一個hardware上運行，此時要如何進行相關的資源分派就變成一個很困難的問題

TensorFlow Lite’s Delegate API可以幫助工程師簡化這些問題，他的角色就是做為TFlite runtime <-> low-level APIs之間的橋樑

如果你是AI developer: 你不用管底層hardware的特性，你只需要透過LiteRT去撰寫你的AI應用，如果vendor有支援對應的hardware delegate，你就可以使用到該硬體的加速性能
如果你是hardware vendor: 你有自己的NPU，你可以透過相關API根據你hardware的特性實作出OP kernel，然後接上LiteRT給developer使用你的硬體

(目前LiteRT官網上支援比較全面的就只有GPU delegate，但針對Android平台Qualcomm® AI Engine Direct Delegate也已經Ready，後面會稍微帶到。)

How to Implement Your Hardware Delegrate?

Document有一頁講到可以如何實作自己的delegate，但什麼時候會需要建立一個custom delegate?

你想要整合一個新的 ML 推論引擎，而該引擎不受任何現有委派的支持
你擁有一個自訂硬體加速器 (例如你是hardware vendor)，可以改善已知情況下的執行時間
你正在開發 CPU 優化（例如OP fusion），可以加速某些模型

以下圖為例，你可以透過delegate來將AI computing graph的部分sub-graph dispatch到特定的accelerator上進行加速

How do delegates work?

假設你原本有一張computing graph如下:

然後你有一個hardware accelerator (或是你想要做CPU OP fusion)，可以將這兩個OP同時處理，你可以把這兩個OP變成一個delegate node一起處理，於是computing graph就變成了

更多細節可以參考document中的example告訴你可以怎麼做。

到此為止，文件中提及了可以怎麼銜接LiteRT跟hardware的low-level implement，至於hardware要如何針對不同的情境, OP做實作就是hardware vendor的工作了。

Hardware Acceleration on Android

Acceleration service

對於Android，LiteRT提供了一個名為Acceleration Service for Android API，他會在一開始先用你的LiteRT model跑過internal inference benchmarks (x-seconds level)

只要準備好model, data sample & golden, acceleration service就會幫你跑完並提供你建議使用的hardware，然後你就可以使用該hardware來進行加速

根據document，目前僅支援GPU / CPU，其他的hardware還在未來的規劃中
1. GpuAccelerationConfig: converted to GPU delegate during the execution time
2. CpuAccelerationConfig

GPU delegates

目前有兩種方式可以使用GPU delegates

有興趣的可以再去進一步了解。

NPU delegates

根據文件描述，目前只有Qualcomm的AI Engine Direct Delegate有對接上LiteRT。

Qualcomm的AI Engine Direct SDK是一個新提出的SDK，用於針對高通的hardware提供一個unified API, modular and extensible per-accelerator libraries, which form a reusable basis for full stack AI solutions.

特地提一下他和Qualcomm的 QPNS (Qualcomm Neural Processing SDK)是不同的東西，QNPS是產生一個可移植的.dlc binary file用於在高通平台上運行AI運算。AI Engine Direct SDK則是封裝了QNPS和其他的開發介面提供了一個更完整的AI solutions，下圖可以看到整個AI Engine Direct SDK的software stack

其他家廠商預計在未來也會support (coming soon)。

前言