文章详细信息

高能效CNN加速器设计
Design of Energy-Efficient CNN Accelerator

喇超,李淼,张峰,张翠婷

1:北京中科格励微科技有限公司
2:中国科学院自动化研究所国家专用集成电路设计工程技术研究中心

摘要(Abstract)：

当前，卷积神经网络（CNN）被广泛应用于图片分类、目标检测与识别以及自然语言理解等领域。随着卷积神经网络的复杂度和规模不断增加，对硬件部署带来了极大的挑战，尤其是面对嵌入式应用领域的低功耗、低时延需求，大多数现有平台存在高功耗、控制复杂的问题。为此，以优化加速器能效为目标，对决定系统能效的关键因素进行分析，以缩放计算精度和降低系统频率为主要出发点，研究极低比特下全网络统一量化方法，设计一种高能效CNN加速器MSNAP。该加速器以1比特权重和4比特激活值的轻量化计算单元为基础，构建了128×128空间并行加速阵列结构，由于空间并行度高，整个系统采用低运行频率。同时，采用权重固定、特征图广播的数据传播方式，有效减少权重、特征图的数据搬移次数，达到降低功耗、提高系统能效比的目的。通过22 nm工艺流片验证，结果表明，在20 MHz频率下，峰值算力达到10.54 TOPS，能效比达到64.317 TOPS/W，相较同类型加速器在采用CIFAR-10数据集的分类网络中，该加速器能效比有5倍的提升。部署的目标检测网络YOLO能够达到60 FPS的检测速率，完全满足嵌入式应用需求。

关键词(KeyWords)： 加速器;卷积神经网络(CNN);轻量化神经元计算单元(NCU);MSNAP;分支卷积量化(BCQ)

基金项目(Foundation):

作者(Author): 喇超,李淼,张峰,张翠婷

参考文献(References)：

[1] HAN H S, HU X, HAO Y F, et al. Real-time robust video object detection system against physical-world adversarial attacks[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2024, 43(1):366-379.
[2]吴瑞东,刘冰,付平,等.应用于极致边缘计算场景的卷积神经网络加速器架构设计[J].电子与信息学报, 2023, 45(6):1933-1943.WU R D, LIU B, FU P, et al. Convolutional neural network accelerator architecture design for ultimate edge computing scenario[J]. Journal of Electronics&Information Technology,2023, 45(6):1933-1943.
[3] MOONS B, BANKMAN D, VERHELST M. Embedded deep learning:algorithms, architectures and circuits for alwayson neural network processing[M]. Berlin, Heidelberg:Springer,2019:55-111.
[4]郭朝鹏,王馨昕,仲昭晋,等.能耗优化的神经网络轻量化方法研究进展[J].计算机学报, 2023, 46(1):85-102.GUO C P, WANG X X, ZHONG Z J, et al. Research advance on neural network lightweight for energy optimization[J].Chinese Journal of Computers, 2023, 46(1):85-102.
[5] JANG M, KIM J, NAM H, et al. Zero and narrow-width value-aware compression for quantized convolutional neural networks[J]. IEEE Transactions on Computers, 2024, 73(1):249-262.
[6] FUJIWARA Y, KAWAHARA T. BNN training algorithm with ternary gradients and BNN based on MRAM array[C]//Proceedings of the 2023 IEEE Region 10 Conference. Piscataway:IEEE, 2023:311-316.
[7] MAO W D, WANG M Q, XIE X R, et al. Hardware accelerator design for sparse DNN inference and training:a tutorial[J]. IEEE Transactions on Circuits and Systems II:Express Briefs, 2024, 71(3):1708-1714.
[8] ARUNACHALAM A, KUNDU S, RAHA A, et al. A novel low-power compression scheme for systolic array-based deep learning accelerators[J]. IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems, 2023, 42(4):1085-1098.
[9] CHEN T S, DU Z D, SUN N H, et al. DianNao[J]. ACM SIGARCH Computer Architecture News, 2014, 42(1):269-284.
[10] DU Z D, FASTHUBER R, CHEN T S, et al. ShiDianNao:shifting vision processing closer to the sensor[C]//Proceedings of the 42nd Annual International Symposium on Computer Architecture. New York:ACM, 2015:92-104.
[11]鲁蔚征,张峰,贺寅烜,等.华为昇腾神经网络加速器性能评测与优化[J].计算机学报, 2022, 45(8):1618-1637.LU W Z, ZHANG F, HE Y X, et al. Evaluation and optimization for Huawei ascend neural network accelerator[J].Chinese Journal of Computers, 2022, 45(8):1618-1637.
[12] JOUPPI N P, YOUNG C, PATIL N, et al. In-datacenter performance analysis of a tensor processing unit[C]//Proceedings of the 44th Annual International Symposium on Computer Architecture. New York:ACM, 2017:1-12.
[13] ROSS J, THORSON G M. Rotating data for neural network computations:US9747548[P]. 2017-08-29.
[14] CHEN Y H, KRISHNA T, EMER J S, et al. Eyeriss:an energyefficient reconfigurable accelerator for deep convolutional neural networks[J]. IEEE Journal of Solid-State Circuits,2017, 52(1):127-138.
[15] FAN H X, LIU S L, FERIANC M, et al. A real-time object detection accelerator with compressed SSDLite on FPGA[C]//Proceedings of the 2018 International Conference on Field-Programmable Technology. Piscataway:IEEE, 2018:14-21.
[16] NGUYEN D T, NGUYEN T N, KIM H, et al. A highthroughput and power-efficient FPGA implementation of YOLO CNN for object detection[J]. IEEE Transactions on Very Large Scale Integration(VLSI)Systems, 2019, 27(8):1861-1873.
[17] LI R D, WANG Y, LIANG F, et al. Fully quantized network for object detection[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE, 2019:2805-2814.
[18] SHARMA H, PARK J, SUDA N, et al. Bit fusion:bit-level dynamically composable architecture for accelerating deep neural network[C]//Proceedings of the 2018 ACM/IEEE45th Annual International Symposium on Computer Architecture. Piscataway:IEEE, 2018:764-775.
[19] HAO Y F, ZHAO Y W, LIU C X, et al. Cambricon-P:a bitflow architecture for arbitrary precision computing[C]//Proceedings of the 2022 55th IEEE/ACM International Symposium on Microarchitecture. Piscataway:IEEE, 2022:57-72.
[20] ZHOU S, WU Y, NI Z, et al. DoReFa-Net:training low bitwidth convolutional neural networks with low bitwidth gradients[EB/OL].[2024-09-14]. https://arxiv.org/abs/1606.06160.
[21] LI M, ZHANG F, ZHANG C T. Branch convolution quantization for object detection[J]. Machine Intelligence Research,2024, 21(6):1192-1200.
[22] CHEN Y W, WANG R H, CHENG Y H, et al. SUN:dynamic hybrid-precision SRAM-based CIM accelerator with high macro utilization using structured pruning mixed-precision networks[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2024, 43(7):2163-2176.
[23] PHAM N S, SUH T. Optimization of microarchitecture and dataflow for sparse tensor CNN acceleration[J]. IEEE Access,2023, 11:108818-108832.
[24] KNAG P C, CHEN G K, SUMBUL H E, et al. A 617-TOPS/W all-digital binary neural network accelerator in 10-nm Fin FET CMOS[J]. IEEE Journal of Solid-State Circuits, 2021, 56(4):1082-1092.
[25] ISONO T, YAMAKURA M, SHIMAYA S, et al. A 12.1 TOPS/W mixed-precision quantized deep convolutional neural network accelerator for low power on edge/endpoint device[C]//Proceedings of the 2020 IEEE Asian Solid-State Circuits Conference. Piscataway:IEEE, 2020:1-4.
[26] CHOI W H, CHIU P F, MA W, et al. An in-flash binary neural network accelerator with SLC NAND flash array[C]//Proceedings of the 2020 IEEE International Symposium on Circuits and Systems. Piscataway:IEEE, 2020:1-5.
[27] DORRANCE R, DASALUKUNTE D, WANG H C, et al.Energy efficient BNN accelerator using CiM and a timeinterleaved hadamard digital GRNG in 22nm CMOS[C]//Proceedings of the 2022 IEEE Asian Solid-State Circuits Conference. Piscataway:IEEE, 2022:2-4.
[28] BANKMAN D, YANG L T, MOONS B, et al. An always-on3.8μJ/86%CIFAR-10 mixed-signal binary CNN processor with all memory on chip in 28nm CMOS[C]//Proceedings of the 2018 IEEE International Solid-State Circuits Conference. Piscataway:IEEE, 2018:222-224.

扩展功能

本文信息

PDF(2148K)

服务与反馈

本文关键词相关文章

本文作者相关文章

中国知网

计算机科学与探索

2025, v.19;No.204(09) 2520-2531

高能效CNN加速器设计
Design of Energy-Efficient CNN Accelerator

喇超,李淼,张峰,张翠婷

参考文献(References)：

计算机科学与探索

2025, v.19;No.204(09) 2520-2531

高能效CNN加速器设计Design of Energy-Efficient CNN Accelerator

喇超,李淼,张峰,张翠婷

参考文献(References)：

高能效CNN加速器设计
Design of Energy-Efficient CNN Accelerator