
CLC number: TP391
On-line Access: 2025-11-17
Received: 2024-11-11
Revision Accepted: 2025-11-18
Crosschecked: 2025-07-08
Cited: 0
Clicked: 1017
Citations: Bibtex RefMan EndNote GB/T7714
Zheyang LI, Chaoxiang LAN, Kai ZHANG, Wenming TAN, Ye REN, Jun XIAO. An adaptive outlier correction quantization method for vision Transformers[J]. Frontiers of Information Technology & Electronic Engineering,in press.https://doi.org/10.1631/FITEE.2400994 @article{title="An adaptive outlier correction quantization method for vision Transformers", %0 Journal Article TY - JOUR
一种面向视觉Transformers的自适应离群值校正量化方法1浙江大学计算机科学与技术学院,中国杭州市,310027 2杭州海康威视数字技术股份有限公司,中国杭州市,310051 摘要:Transformer模型虽已在多个领域展现出显著成效,但其巨大的计算和内存需求对其应用构成限制,尤其在资源受限的边缘设备上部署时面临挑战。量化作为一种有效的模型压缩方法,能显著降低Transformer在边缘设备上的运行时间。值得注意的是,与卷积神经网络(CNN)相比,Transformer的激活值表现出更为显著的离群值,导致不同通道和令牌间的特征分布不均。为应对此问题,提出一种自适应离群值校正量化(AOCQ)方法,该方法能显著降低这些离群值的不利影响。AOCQ在3个层级上调整通道和令牌间的显著差异:算子级;框架级;损失级。引入一种新颖的算子,它能等效平衡不同通道间的激活值,并在框架层面增设一个额外的阶段,以优化激活值的量化步骤。此外,在损失层面,将各令牌和各通道间的不均衡激活值转移到模型权重的优化过程中。理论分析表明,该方法能有效降低量化误差。所提方法的有效性已在在多种基准模型和任务上得到验证。经过8位训练后量化的DeiT-B模型在仅损失0.28个百分点精度的情况下,精度达到81.57%,同时实现4倍的推理加速。此外,在包括图像分类和目标检测在内的多项任务中,Swin Transformer和DeiT的权重可被训练后量化到4位,精度损失仅为2%,同时所需内存仅为原来的1/8。 关键词组: Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article
Reference[1]Alam N, Kolawole S, Sethi S, et al., 2023. Vision Transformers for mobile applications: a short survey. https://arxiv.org/abs/2305.19365 ![]() [2]Ba J, Grosse R, Martens J, 2017. Distributed second-order optimization using Kronecker-factored approximations. Proc 5th Int Conf on Learning Representations, p.1-17. ![]() [3]Ba JL, Kiros JR, Hinton GE, 2016. Layer normalization. https://arxiv.org/abs/1607.06450 ![]() [4]Bengio Y, Leonard N, Courville A, 2013. Estimating or propagating gradients through stochastic neurons for conditional computation. https://arxiv.org/abs/1308.3432 ![]() [5]Carion N, Massa F, Synnaeve G, et al., 2020. End-to-end object detection with Transformers. Proc 16th European Conf on Computer Vision, p.213-229. ![]() [6]Chen MH, Peng HW, Fu JL, et al., 2021. AutoFormer: searching Transformers for visual recognition. Proc IEEE/CVF Int Conf on Computer Vision, p.12250-12260. ![]() [7]Chen ZS, Xie LX, Niu JW, et al., 2021. Visformer: the vision-friendly Transformer. Proc IEEE/CVF Int Conf on Computer Vision, p.569-578. ![]() [8]Chitty-Venkata KT, Mittal S, Emani M, et al., 2023. A survey of techniques for optimizing Transformer inference. J Syst Archit, 144:102990. ![]() [9]Choi J, Wang Z, Venkataramani S, et al., 2018. PACT: parameterized clipping activation for quantized neural networks. https://arxiv.org/abs/1805.06085 ![]() [10]Choromanski KM, Likhosherstov V, Dohan D, et al., 2021. Rethinking attention with performers. Proc 9th Int Conf on Learning Representations, p.1-38. ![]() [11]Devlin J, Chang MW, Lee K, et al., 2019. BERT: pre-training of deep bidirectional Transformers for language understanding. Proc Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, p.4171-4186. ![]() [12]Ding MY, Xiao B, Codella N, et al., 2022. DaViT: dual attention vision Transformers. Proc 17th European Conf on Computer Vision, p.74-92. ![]() [13]Dong XY, Bao JM, Chen DD, et al., 2022. CSWin Transformer: a general vision Transformer backbone with cross-shaped windows. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.12114-12124. ![]() [14]Dosovitskiy A, Beyer L, Kolesnikov A, et al., 2021. An image is worth 16×16 words: Transformers for image recognition at scale. Proc 9th Int Conf on Learning Representations, p.1-21. ![]() [15]Esser SK, McKinstry JL, Bablani D, et al., 2020. Learned step size quantization. Proc 8th Int Conf on Learning Representations, p.1-12. ![]() [16]Gong RH, Liu XL, Jiang SH, et al., 2019. Differentiable soft quantization: bridging full-precision and low-bit neural networks. Proc IEEE/CVF Int Conf on Computer Vision, p.4851-4860. ![]() [17]Graham B, El-Nouby A, Touvron H, et al., 2021. LeViT: a vision Transformer in ConvNet's clothing for faster inference. Proc IEEE/CVF Int Conf on Computer Vision, p.12239-12249. ![]() [18]Grosse RB, Martens J, 2016. A Kronecker-factored approximate Fisher matrix for convolution layers. Proc 33rd Int Conf on Machine Learning, p.573-582. ![]() [19]Ham TJ, Jung SJ, Kim S, et al., 2020. A3: accelerating attention mechanisms in neural networks with approximation. Proc IEEE Int Symp on High Performance Computer Architecture, p.328-341. ![]() [20]Han DC, Pan XR, Han YZ, et al., 2023. FLatten Transformer: vision Transformer using focused linear attention. Proc IEEE/CVF Int Conf on Computer Vision, p.5938-5948. ![]() [21]Han S, Pool J, Tran J, et al., 2015. Learning both weights and connections for efficient neural networks. Proc 29th Int Conf on Neural Information Processing Systems, p.1135-1143. ![]() [22]Hatamizadeh A, Heinrich G, Yin HX, et al., 2024. FasterViT: fast vision Transformers with hierarchical attention. Proc 12th Int Conf on Learning Representations, p.1-24. ![]() [23]Hinton G, Vinyals O, Dean J, 2015. Distilling the knowledge in a neural network. Comput Sci, 14(7):38-39. ![]() [24]Hong K, Dai GH, Xu JM, et al., 2023. FlashDecoding++: faster large language model inference on GPUs. https://arxiv.org/abs/2311.01282 ![]() [25]Huang L, Qin J, Liu L, et al., 2020. Layer-wise conditioning analysis in exploring the learning dynamics of DNNs. Proc 16th European Conf on Computer Vision, p.384-401. ![]() [26]LeCun Y, Kanter I, Sona SA, 1990. Second order properties of error surfaces: learning time and generalization. Proc 4th Int Conf on Neural Information Processing Systems, p.918-924. ![]() [27]LeCun Y, Bottou L, Orr GB, et al., 2012. Efficient BackProp. In: Montavon G, Orr GB, Miller KB (Eds.), Neural Networks: Tricks of the Trade. Springer, Berlin, Heidelberg, p.9-48. ![]() [28]Li F, Zhang H, Liu SL, et al., 2022. DN-DETR: accelerate DETR training by introducing query denoising. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.13609-13617. ![]() [29]Li F, Zhang H, Sun P, et al., 2025. Segment and recognize anything at any granularity. Proc 18th European Conf on Computer Vision, p.467-484. ![]() [30]Li YH, Gong RH, Tan X, et al., 2021. BRECQ: pushing the limit of post-training quantization by block reconstruction. Proc 9th Int Conf on Learning Representations, p.1-16. ![]() [31]Li ZK, Ma LP, Chen MJ, et al., 2022. Patch similarity aware data-free quantization for vision Transformers. Proc 17th European Conf on Computer Vision, p.154-170. ![]() [32]Li ZK, Xiao JR, Yang LW, et al., 2023. RepQ-ViT: scale reparameterization for post-training quantization of vision transformers. Proc IEEE/CVF Int Conf on Computer Vision, p.17181-17190. ![]() [33]Li ZK, Chen MJ, Xiao JR, et al., 2024. PSAQ-ViT V2: toward accurate and general data-free quantization for vision transformers. IEEE Trans Neur Netw Learn Syst, 35(12):17227-17238. ![]() [34]Liang JY, Cao JZ, Sun GL, et al., 2021. SwinIR: image restoration using Swin Transformer. Proc IEEE/CVF Int Conf on Computer Vision Workshops, p.1833-1844. ![]() [35]Lin Y, Zhang TY, Sun PQ, et al., 2022. FQ-ViT: post-training quantization for fully quantized vision Transformer. Proc 31st Int Joint Conf on Artificial Intelligence, p.1173-1179. ![]() [36]Liu SL, Li F, Zhang H, et al., 2022. DAB-DETR: dynamic anchor boxes are better queries for DETR. Proc 10th Int Conf on Learning Representations, p.1-20. ![]() [37]Liu Z, Lin YT, Cao Y, et al., 2021. Swin Transformer: hierarchical vision Transformer using shifted windows. Proc IEEE/CVF Int Conf on Computer Vision, p.9992-10002. ![]() [38]Liu Z, Ning J, Cao Y, et al., 2022a. Video Swin Transformer. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.3192-3201. ![]() [39]Liu Z, Hu H, Lin YT, et al., 2022b. Swin Transformer V2: scaling up capacity and resolution. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.11999-12009. ![]() [40]Liu ZH, Wang YH, Han K, et al., 2021. Post-training quantization for vision Transformer. Proc 35th Int Conf on Neural Information Processing Systems, Article 2152. ![]() [41]Mehta S, Ghazvininejad M, Iyer S, et al., 2021. DeLighT: deep and light-weight Transformer. Proc 9th Int Conf on Learning Representations, p.1-19. ![]() [42]Nagel M, Amjad RA, van Baalen M, et al., 2020. Up or down? Adaptive rounding for post-training quantization. Proc 37th Int Conf on Machine Learning, Article 667. ![]() [43]Qu Z, Liu L, Tu FB, et al., 2022. DOTA: detect and omit weak attentions for scalable Transformer acceleration. Proc 27th ACM Int Conf on Architectural Support for Programming Languages and Operating Systems, p.14-26. ![]() [44]Ren SQ, He KM, Girshick R, et al., 2015. Faster R-CNN: towards real-time object detection with region proposal networks. Proc 29th Int Conf on Neural Information Processing Systems, p.91-99. ![]() [45]Sanh V, Debut L, Chaumond J, et al., 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. https://arxiv.org/abs/1910.01108 ![]() [46]Shen S, Yao ZW, Gholami A, et al., 2020. PowerNorm: rethinking batch normalization in Transformers. Proc 37th Int Conf on Machine Learning, Article 811. ![]() [47]Si CY, Yu WH, Zhou P, et al., 2022. Inception Transformer. Proc 36th Int Conf on Neural Information Processing Systems, Article 1707. ![]() [48]Touvron H, Cord M, Sablayrolles A, et al., 2021a. Going deeper with image Transformers. Proc IEEE/CVF Int Conf on Computer Vision, p.32-42. ![]() [49]Touvron H, Cord M, Douze M, et al., 2021b. Training data-efficient image Transformers & distillation through attention. Proc 38th Int Conf on Machine Learning, p.10347-10357. ![]() [50]Vaswani A, Shazeer N, Parmar N, et al., 2017. Attention is all you need. Proc 31st Int Conf on Neural Information Processing Systems, p.6000-6010. ![]() [51]Yang QM, Zhang K, Lan CX, et al., 2022. Unified normalization for accelerating and stabilizing Transformers. Proc 30th ACM Int Conf on Multimedia, p.4445-4455. ![]() [52]Yao ZW, Aminabadi RY, Zhang MJ, et al., 2022. ZeroQuant: efficient and affordable post-training quantization for large-scale Transformers. Proc 36th Int Conf on Neural Information Processing Systems, Article 1970. ![]() [53]Yu XD, Shi DH, Wei X, et al., 2022. SOIT: segmenting objects with instance-aware Transformers. Proc 36th AAAI Conf on Artificial Intelligence, p.3188-3196. ![]() [54]Yuan L, Chen YP, Wang T, et al., 2021. Tokens-to-Token ViT: training vision Transformers from scratch on ImageNet. Proc IEEE/CVF Int Conf on Computer Vision, p.538-547. ![]() [55]Yuan ZH, Xue CH, Chen YQ, et al., 2022. PTQ4ViT: post-training quantization framework for vision Transformers. https://arxiv.org/abs/2111.12293 ![]() [56]Zafrir O, Boudoukh G, Izsak P, et al., 2019. Q8BERT: quantized 8bit BERT. Proc 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS Edition, p.36-39. ![]() [57]Zhang H, Li F, Liu SL, et al., 2023. DINO: DETR with improved denoising anchor boxes for end-to-end object detection. Proc 11th Int Conf on Learning Representations, p.1-19. ![]() [58]Zhang XS, Tian YJ, Xie LX, et al., 2023. HiViT: a simpler and more efficient design of hierarchical vision Transformer. Proc 11th Int Conf on Learning Representations, p.1-15. ![]() [59]Zheng CY, Li ZY, Zhang K, et al., 2022. SAViT: structure-aware vision Transformer pruning via collaborative optimization. Proc 36th Annual Conf on Neural Information Processing Systems, Article 655. ![]() [60]Zhu XZ, Su WJ, Lu LW, et al., 2021. Deformable DETR: deformable Transformers for end-to-end object detection. Proc 9th Int Conf on Learning Representations, p.1-16. ![]() Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou
310027, China
Tel: +86-571-87952783; E-mail: cjzhang@zju.edu.cn Copyright © 2000 - 2026 Journal of Zhejiang University-SCIENCE | ||||||||||||||


ORCID:
Open peer comments: Debate/Discuss/Question/Opinion
<1>