Journal of Zhejiang University

ENGINEERING Information Technology & Electronic Engineering 2026 Vol.27 No.1 P.1-12

http://doi.org/10.1631/ENG.ITEE.2025.0081

TP-ViT: truncated uniform-log2 quantizer and progressive bit-decline reconstruction for vision Transformer quantization

Author(s): Xichuan ZHOU, Sihuan ZHAO, Rui DING, Jiayu SHI, Jing NIE, Lihui CHEN, Haijun LIU
Affiliation(s): 1. School of Microelectronics and Communication Engineering, Chongqing University, Chongqing 400044, China
Corresponding email(s): haijun_liu@126.com
Key Words: Vision Transformers, Post-training quantization, Block reconstruction, Image classification, Object detection, Instance segmentation

Share this article to： More <<< Previous Article \|Next Article >>>

Xichuan ZHOU, Sihuan ZHAO, Rui DING, Jiayu SHI, Jing NIE, Lihui CHEN, Haijun LIU. TP-ViT: truncated uniform-log2 quantizer and progressive bit-decline reconstruction for vision Transformer quantization[J]. Journal of Zhejiang University Science C, 2026, 27(1): 1-12.

@article{title="TP-ViT: truncated uniform-log2 quantizer and progressive bit-decline reconstruction for vision Transformer quantization",
author="Xichuan ZHOU, Sihuan ZHAO, Rui DING, Jiayu SHI, Jing NIE, Lihui CHEN, Haijun LIU",
journal="Journal of Zhejiang University Science C",
volume="27",
number="1",
pages="1-12",
year="2026",
publisher="Zhejiang University Press & Springer",
doi="10.1631/ENG.ITEE.2025.0081"
}

%0 Journal Article
%T TP-ViT: truncated uniform-log2 quantizer and progressive bit-decline reconstruction for vision Transformer quantization
%A Xichuan ZHOU
%A Sihuan ZHAO
%A Rui DING
%A Jiayu SHI
%A Jing NIE
%A Lihui CHEN
%A Haijun LIU
%J Frontiers of Information Technology & Electronic Engineering
%V 27
%N 1
%P 1-12
%@ 1869-1951
%D 2026
%I Zhejiang University Press & Springer
%DOI 10.1631/ENG.ITEE.2025.0081

TY - JOUR
T1 - TP-ViT: truncated uniform-log2 quantizer and progressive bit-decline reconstruction for vision Transformer quantization
A1 - Xichuan ZHOU
A1 - Sihuan ZHAO
A1 - Rui DING
A1 - Jiayu SHI
A1 - Jing NIE
A1 - Lihui CHEN
A1 - Haijun LIU
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 27
IS - 1
SP - 1
EP - 12
%@ 1869-1951
Y1 - 2026
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/ENG.ITEE.2025.0081

Abstract
Chinese Summary
Academic Network
Reviewer Comment

Abstract: vision Transformers (ViTs) have achieved remarkable success across various artificial intelligence-based computer vision applications. However, their demanding computational and memory requirements pose significant challenges for deployment on resource-constrained edge devices. Although post-training quantization (PTQ) provides a promising solution by reducing model precision with minimal calibration data, aggressive low-bit quantization typically leads to substantial performance degradation. To address this challenge, we present the truncated uniform-log2 quantizer and progressive bit-decline reconstruction method for vision Transformer quantization (TP-ViT). It is an innovative PTQ framework specifically designed for ViTs, featuring two key technical contributions: (1) truncated uniform-log2 quantizer, a novel quantization approach which effectively handles outlier values in post-Softmax activations, significantly reducing quantization errors; (2) bit-decline optimization strategy, which employs transition weights to gradually reduce bit precision while maintaining model performance under extreme quantization conditions. Comprehensive experiments on image classification, object detection, and instance segmentation tasks demonstrate TP-ViT’s superior performance compared to state-of-the-art PTQ methods, particularly in challenging 3-bit quantization scenarios. Our framework achieves a notable 6.18 percentage points improvement in top-1 accuracy for ViT-small under 3-bit quantization. These results validate TP-ViT’s robustness and general applicability, paving the way for more efficient deployment of ViT models in computer vision applications on edge hardware.

TP-ViT：面向视觉Transformer量化的截断均匀对数量化器与渐进式比特衰减重建方法

周喜川，赵思桓，丁睿，史家宇，聂晶，陈黎辉，刘海军
重庆大学微电子与通信工程学院，中国重庆市，400044
摘要：视觉Transformer（ViT）在各类人工智能驱动的计算机视觉应用中取得显著成功。然而，ViT对计算和内存需求较高，限制了其在资源受限的边缘设备上的实际应用。尽管后训练量化（PTQ）通过使用少量校准数据降低模型精度，为此提供了一种前景广阔的解决方案，但激进的低比特量化通常会导致性能大幅下降。为应对这一挑战，提出面向视觉Transformer量化的截断均匀对数量化器与渐进式比特衰减重建方法（TP-ViT）。该方法是一种专为ViT设计的创新性PTQ框架，包含两项关键技术贡献：（1）截断均匀对数量化器--该新型量化方法能有效处理Softmax后激活中的异常值，显著降低量化误差；（2）比特衰减优化策略—利用过渡权重逐步降低比特精度，在极端量化条件下仍保持模型性能。在图像分类、目标检测和实例分割任务上的综合实验表明，TP-ViT相较于当前最先进的PTQ方法具有卓越性能，尤其在极具挑战性的3比特量化场景下表现突出。在3比特量化条件下，本框架在ViT-small模型上实现了Top-1准确率6.18个百分点的提升。结果验证了TP-ViT的鲁棒性与普适性，为ViT模型在边缘硬件计算机视觉应用中的高效部署铺平了道路。

关键词：视觉Transformer；后训练量化；块重建；图像分类；目标检测；实例分割

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

[1]Cai ZW, Vasconcelos N, 2018. Cascade R-CNN: delving into high quality object detection. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.6154-6162.

[2]Chen J, Zhang HY, Gong MM, et al., 2024. Collaborative compensative Transformer network for salient object detection. Patt Recogn, 154:110600.

[3]Deng J, Dong W, Socher R, et al., 2009. ImageNet: a large-scale hierarchical image database. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.248-255.

[4]Ding YF, Qin HT, Yan QH, et al., 2022. Towards accurate post-training quantization for vision Transformer. Proc 30^th ACM Int Conf on Multimedia, p.5380-5388.

[5]Dosovitskiy A, Beyer L, Kolesnikov A, et al., 2021. An image is worth 16×16 words: Transformers for image recognition at scale.

[6]Gao LN, Liu B, Fu P, et al., 2024. TSVT: token sparsification vision Transformer for robust RGB-D salient object detection. Patt Recogn, 148:110190.

[7]He XY, Lu Y, Liu H, et al., 2025. ORQ-ViT: outlier resilient post training quantization for vision Transformers via outlier decomposition. J Syst Architect, 168:103530.

[8]Jiang RQ, Zhang Y, Wang LG, et al., 2025. AIQViT: architecture-informed post-training quantization for vision Transformers.

[9]Jiang YF, Sun N, Xie XS, et al., 2025. ADFQ-ViT: activation-distribution-friendly post-training quantization for vision Transformers. Neur Netw, 186:107289.

[10]Kim HJ, Shin JW, Del Barrio AA, 2022. CTMQ: cyclic training of convolutional neural networks with multiple quantization steps.

[11]Kingma DP, Ba J, 2017. Adam: a method for stochastic optimization.

[12]Li MH, Halstead M, McCool C, 2024. Knowledge distillation for efficient instance semantic segmentation with Transformers. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition Workshops, p.5432-5439.

[13]Li RD, Wang Y, Liang F, et al., 2019. Fully quantized network for object detection. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.2805-2814.

[14]Li YH, Gong RH, Tan X, et al., 2021. BRECQ: pushing the limit of post-training quantization by block reconstruction.

[15]Li ZK, Xiao JR, Yang LW, et al., 2023. RepQ-ViT: scale reparameterization for post-training quantization of vision Transformers. Proc IEEE/CVF Int Conf on Computer Vision, p.17181-17190.

[16]Lin TY, Maire M, Belongie S, et al., 2014. Microsoft COCO: common objects in context. Proc 13^th European Conf on Computer Vision, p.740-755.

[17]Lin Y, Zhang TY, Sun PQ, et al., 2023. FQ-ViT: post-training quantization for fully quantized vision Transformer.

[18]Liu JW, Niu L, Yuan ZH, et al., 2023. PD-Quant: post-training quantization based on prediction difference metric. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.24427-24437.

[19]Liu SY, Liu ZC, Cheng KT, 2023. Oscillation-free quantization for low-bit vision Transformers. Proc 40^th Int Conf on Machine Learning, p.21813-21824.

[20]Liu Z, Lin YT, Cao Y, et al., 2021. Swin Transformer: hierarchical vision Transformer using shifted windows. Proc IEEE/CVF Int Conf on Computer Vision, p.9992-10002.

[21]Ma YX, Li HX, Zheng XW, et al., 2024. Outlier-aware slicing for post-training quantization in vision Transformer. Proc 41^st Int Conf on Machine Learning, p.33811-33825.

[22]Mahmood T, Wahid A, Hong JS, et al., 2024. A novel convolution Transformer-based network for histopathology-image classification using adaptive convolution and dynamic attention. Eng Appl Artif Intell, 135:108824.

[23]Mehta S, Rastegari M, 2022. MobileViT: light-weight, general-purpose, and mobile-friendly vision Transformer.

[24]Nagel M, Fournarakis M, Bondarenko Y, et al., 2022. Overcoming oscillations in quantization-aware training. Proc 39^th Int Conf on Machine Learning, p.16318-16330.

[25]Selvaraju RR, Cogswell M, Das A, et al., 2020. Grad-CAM: visual explanations from deep networks via gradient-based localization. Int J Comput Vis, 128:336-359.

[26]Tai YS, Wu AYA, 2025. AMP-ViT: optimizing vision Transformer efficiency with adaptive mixed-precision post-training quantization. IEEE/CVF Winter Conf on Applications of Computer Vision, p.6828-6837.

[27]Tai YS, Lin MG, Wu AYA, 2023. TSPTQ-ViT: two-scaled post-training quantization for vision Transformer. IEEE Int Conf on Acoustics, Speech and Signal Processing, p.1-5.

[28]Tian YC, Han JH, Chen HT, et al., 2024. Instruct-IPT: all-in-one image processing Transformer via weight modulation.

[29]Touvron H, Cord M, Douze M, et al., 2021. Training data-efficient image Transformers & distillation through attention. Proc 38^th Int Conf on Machine Learning, p.10347-10357.

[30]Wei XY, Gong RH, Li YH, et al., 2023. QDrop: randomly dropping quantization for extremely low-bit post-training quantization.

[31]Xia ZR, Dai L, Chen ZH, et al., 2025. Multi-stage feature aggregation Transformer for image rain and haze joint removal. Eng Appl Artif Intell, 149:110490.

[32]Yuan ZH, Xue CH, Chen YQ, et al., 2022. PTQ4ViT: post-training quantization for vision Transformers with twin uniform quantization. 17^th European Conf on Computer Vision, p.191-207.

[33]Zamir SW, Arora A, Khan S, et al., 2022. Restormer: efficient Transformer for high-resolution image restoration. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.5718-5729.

[34]Zhang JN, Peng HW, Wu K, et al., 2022. MiniViT: compressing vision Transformers with weight multiplexing. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.12135-12144.

[35]Zhang ZC, Chen ZD, Wang YX, et al., 2024. A vision Transformer for fine-grained classification by reducing noise and enhancing discriminative information. Patt Recogn, 145:109979.

[36]Zheng X, Luo YH, Zhou PY, et al., 2025. Distilling efficient vision Transformers from CNNs for semantic segmentation. Patt Recogn, 158:111029.

[37]Zhong YS, Lin MB, Li XC, et al., 2022. Dynamic dual trainable bounds for ultra-low precision super-resolution networks. 17^th European Conf on Computer Vision, p.1-18.

[38]Zhong YS, Hu JW, Lin MB, et al., 2026. I&S-ViT: an inclusive & stable method for post-training ViTs quantization. IEEE Trans Patt Anal Mach Intell, 48(2):1063-1080.

[39]Zhou XC, Ding R, Wang YX, et al., 2023. Cellular binary neural network for accurate image classification and semantic segmentation. IEEE Trans Multim, 25:8064-8075.

Open peer comments: Debate/Discuss/Question/Opinion

<1>