高能效CNN加速器设计Design of Energy-Efficient CNN Accelerator
喇超,李淼,张峰,张翠婷
摘要(Abstract):
当前,卷积神经网络(CNN)被广泛应用于图片分类、目标检测与识别以及自然语言理解等领域。随着卷积神经网络的复杂度和规模不断增加,对硬件部署带来了极大的挑战,尤其是面对嵌入式应用领域的低功耗、低时延需求,大多数现有平台存在高功耗、控制复杂的问题。为此,以优化加速器能效为目标,对决定系统能效的关键因素进行分析,以缩放计算精度和降低系统频率为主要出发点,研究极低比特下全网络统一量化方法,设计一种高能效CNN加速器MSNAP。该加速器以1比特权重和4比特激活值的轻量化计算单元为基础,构建了128×128空间并行加速阵列结构,由于空间并行度高,整个系统采用低运行频率。同时,采用权重固定、特征图广播的数据传播方式,有效减少权重、特征图的数据搬移次数,达到降低功耗、提高系统能效比的目的。通过22 nm工艺流片验证,结果表明,在20 MHz频率下,峰值算力达到10.54 TOPS,能效比达到64.317 TOPS/W,相较同类型加速器在采用CIFAR-10数据集的分类网络中,该加速器能效比有5倍的提升。部署的目标检测网络YOLO能够达到60 FPS的检测速率,完全满足嵌入式应用需求。
关键词(KeyWords): 加速器;卷积神经网络(CNN);轻量化神经元计算单元(NCU);MSNAP;分支卷积量化(BCQ)
基金项目(Foundation):
作者(Author): 喇超,李淼,张峰,张翠婷
参考文献(References):
- [1] HAN H S, HU X, HAO Y F, et al. Real-time robust video object detection system against physical-world adversarial attacks[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2024, 43(1):366-379.
- [2]吴瑞东,刘冰,付平,等.应用于极致边缘计算场景的卷积神经网络加速器架构设计[J].电子与信息学报, 2023, 45(6):1933-1943.WU R D, LIU B, FU P, et al. Convolutional neural network accelerator architecture design for ultimate edge computing scenario[J]. Journal of Electronics&Information Technology,2023, 45(6):1933-1943.
- [3] MOONS B, BANKMAN D, VERHELST M. Embedded deep learning:algorithms, architectures and circuits for alwayson neural network processing[M]. Berlin, Heidelberg:Springer,2019:55-111.
- [4]郭朝鹏,王馨昕,仲昭晋,等.能耗优化的神经网络轻量化方法研究进展[J].计算机学报, 2023, 46(1):85-102.GUO C P, WANG X X, ZHONG Z J, et al. Research advance on neural network lightweight for energy optimization[J].Chinese Journal of Computers, 2023, 46(1):85-102.
- [5] JANG M, KIM J, NAM H, et al. Zero and narrow-width value-aware compression for quantized convolutional neural networks[J]. IEEE Transactions on Computers, 2024, 73(1):249-262.
- [6] FUJIWARA Y, KAWAHARA T. BNN training algorithm with ternary gradients and BNN based on MRAM array[C]//Proceedings of the 2023 IEEE Region 10 Conference. Piscataway:IEEE, 2023:311-316.
- [7] MAO W D, WANG M Q, XIE X R, et al. Hardware accelerator design for sparse DNN inference and training:a tutorial[J]. IEEE Transactions on Circuits and Systems II:Express Briefs, 2024, 71(3):1708-1714.
- [8] ARUNACHALAM A, KUNDU S, RAHA A, et al. A novel low-power compression scheme for systolic array-based deep learning accelerators[J]. IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems, 2023, 42(4):1085-1098.
- [9] CHEN T S, DU Z D, SUN N H, et al. DianNao[J]. ACM SIGARCH Computer Architecture News, 2014, 42(1):269-284.
- [10] DU Z D, FASTHUBER R, CHEN T S, et al. ShiDianNao:shifting vision processing closer to the sensor[C]//Proceedings of the 42nd Annual International Symposium on Computer Architecture. New York:ACM, 2015:92-104.
- [11]鲁蔚征,张峰,贺寅烜,等.华为昇腾神经网络加速器性能评测与优化[J].计算机学报, 2022, 45(8):1618-1637.LU W Z, ZHANG F, HE Y X, et al. Evaluation and optimization for Huawei ascend neural network accelerator[J].Chinese Journal of Computers, 2022, 45(8):1618-1637.
- [12] JOUPPI N P, YOUNG C, PATIL N, et al. In-datacenter performance analysis of a tensor processing unit[C]//Proceedings of the 44th Annual International Symposium on Computer Architecture. New York:ACM, 2017:1-12.
- [13] ROSS J, THORSON G M. Rotating data for neural network computations:US9747548[P]. 2017-08-29.
- [14] CHEN Y H, KRISHNA T, EMER J S, et al. Eyeriss:an energyefficient reconfigurable accelerator for deep convolutional neural networks[J]. IEEE Journal of Solid-State Circuits,2017, 52(1):127-138.
- [15] FAN H X, LIU S L, FERIANC M, et al. A real-time object detection accelerator with compressed SSDLite on FPGA[C]//Proceedings of the 2018 International Conference on Field-Programmable Technology. Piscataway:IEEE, 2018:14-21.
- [16] NGUYEN D T, NGUYEN T N, KIM H, et al. A highthroughput and power-efficient FPGA implementation of YOLO CNN for object detection[J]. IEEE Transactions on Very Large Scale Integration(VLSI)Systems, 2019, 27(8):1861-1873.
- [17] LI R D, WANG Y, LIANG F, et al. Fully quantized network for object detection[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE, 2019:2805-2814.
- [18] SHARMA H, PARK J, SUDA N, et al. Bit fusion:bit-level dynamically composable architecture for accelerating deep neural network[C]//Proceedings of the 2018 ACM/IEEE45th Annual International Symposium on Computer Architecture. Piscataway:IEEE, 2018:764-775.
- [19] HAO Y F, ZHAO Y W, LIU C X, et al. Cambricon-P:a bitflow architecture for arbitrary precision computing[C]//Proceedings of the 2022 55th IEEE/ACM International Symposium on Microarchitecture. Piscataway:IEEE, 2022:57-72.
- [20] ZHOU S, WU Y, NI Z, et al. DoReFa-Net:training low bitwidth convolutional neural networks with low bitwidth gradients[EB/OL].[2024-09-14]. https://arxiv.org/abs/1606.06160.
- [21] LI M, ZHANG F, ZHANG C T. Branch convolution quantization for object detection[J]. Machine Intelligence Research,2024, 21(6):1192-1200.
- [22] CHEN Y W, WANG R H, CHENG Y H, et al. SUN:dynamic hybrid-precision SRAM-based CIM accelerator with high macro utilization using structured pruning mixed-precision networks[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2024, 43(7):2163-2176.
- [23] PHAM N S, SUH T. Optimization of microarchitecture and dataflow for sparse tensor CNN acceleration[J]. IEEE Access,2023, 11:108818-108832.
- [24] KNAG P C, CHEN G K, SUMBUL H E, et al. A 617-TOPS/W all-digital binary neural network accelerator in 10-nm Fin FET CMOS[J]. IEEE Journal of Solid-State Circuits, 2021, 56(4):1082-1092.
- [25] ISONO T, YAMAKURA M, SHIMAYA S, et al. A 12.1 TOPS/W mixed-precision quantized deep convolutional neural network accelerator for low power on edge/endpoint device[C]//Proceedings of the 2020 IEEE Asian Solid-State Circuits Conference. Piscataway:IEEE, 2020:1-4.
- [26] CHOI W H, CHIU P F, MA W, et al. An in-flash binary neural network accelerator with SLC NAND flash array[C]//Proceedings of the 2020 IEEE International Symposium on Circuits and Systems. Piscataway:IEEE, 2020:1-5.
- [27] DORRANCE R, DASALUKUNTE D, WANG H C, et al.Energy efficient BNN accelerator using CiM and a timeinterleaved hadamard digital GRNG in 22nm CMOS[C]//Proceedings of the 2022 IEEE Asian Solid-State Circuits Conference. Piscataway:IEEE, 2022:2-4.
- [28] BANKMAN D, YANG L T, MOONS B, et al. An always-on3.8μJ/86%CIFAR-10 mixed-signal binary CNN processor with all memory on chip in 28nm CMOS[C]//Proceedings of the 2018 IEEE International Solid-State Circuits Conference. Piscataway:IEEE, 2018:222-224.