
CLC number: TP391.4
On-line Access: 2026-03-02
Received: 2025-10-13
Revision Accepted: 2026-01-15
Crosschecked: 2026-03-02
Cited: 0
Clicked: 189
Citations: Bibtex RefMan EndNote GB/T7714
Xichuan ZHOU, Sihuan ZHAO, Rui DING, Jiayu SHI, Jing NIE, Lihui CHEN, Haijun LIU. TP-ViT: truncated uniform-log2 quantizer and progressive bit-decline reconstruction for vision Transformer quantization[J]. Journal of Zhejiang University Science C, 2026, 27(1): 1-12.
@article{title="TP-ViT: truncated uniform-log2 quantizer and progressive bit-decline reconstruction for vision Transformer quantization",
author="Xichuan ZHOU, Sihuan ZHAO, Rui DING, Jiayu SHI, Jing NIE, Lihui CHEN, Haijun LIU",
journal="Journal of Zhejiang University Science C",
volume="27",
number="1",
pages="1-12",
year="2026",
publisher="Zhejiang University Press & Springer",
doi="10.1631/ENG.ITEE.2025.0081"
}
%0 Journal Article
%T TP-ViT: truncated uniform-log2 quantizer and progressive bit-decline reconstruction for vision Transformer quantization
%A Xichuan ZHOU
%A Sihuan ZHAO
%A Rui DING
%A Jiayu SHI
%A Jing NIE
%A Lihui CHEN
%A Haijun LIU
%J Frontiers of Information Technology & Electronic Engineering
%V 27
%N 1
%P 1-12
%@ 1869-1951
%D 2026
%I Zhejiang University Press & Springer
%DOI 10.1631/ENG.ITEE.2025.0081
TY - JOUR
T1 - TP-ViT: truncated uniform-log2 quantizer and progressive bit-decline reconstruction for vision Transformer quantization
A1 - Xichuan ZHOU
A1 - Sihuan ZHAO
A1 - Rui DING
A1 - Jiayu SHI
A1 - Jing NIE
A1 - Lihui CHEN
A1 - Haijun LIU
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 27
IS - 1
SP - 1
EP - 12
%@ 1869-1951
Y1 - 2026
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/ENG.ITEE.2025.0081
Abstract: vision Transformers (ViTs) have achieved remarkable success across various artificial intelligence-based computer vision applications. However, their demanding computational and memory requirements pose significant challenges for deployment on resource-constrained edge devices. Although post-training quantization (PTQ) provides a promising solution by reducing model precision with minimal calibration data, aggressive low-bit quantization typically leads to substantial performance degradation. To address this challenge, we present the truncated uniform-log2 quantizer and progressive bit-decline reconstruction method for vision Transformer quantization (TP-ViT). It is an innovative PTQ framework specifically designed for ViTs, featuring two key technical contributions: (1) truncated uniform-log2 quantizer, a novel quantization approach which effectively handles outlier values in post-Softmax activations, significantly reducing quantization errors; (2) bit-decline optimization strategy, which employs transition weights to gradually reduce bit precision while maintaining model performance under extreme quantization conditions. Comprehensive experiments on image classification, object detection, and instance segmentation tasks demonstrate TP-ViT’s superior performance compared to state-of-the-art PTQ methods, particularly in challenging 3-bit quantization scenarios. Our framework achieves a notable 6.18 percentage points improvement in top-1 accuracy for ViT-small under 3-bit quantization. These results validate TP-ViT’s robustness and general applicability, paving the way for more efficient deployment of ViT models in computer vision applications on edge hardware.
[1]Cai ZW, Vasconcelos N, 2018. Cascade R-CNN: delving into high quality object detection. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.6154-6162.
[2]Chen J, Zhang HY, Gong MM, et al., 2024. Collaborative compensative Transformer network for salient object detection. Patt Recogn, 154:110600.
[3]Deng J, Dong W, Socher R, et al., 2009. ImageNet: a large-scale hierarchical image database. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.248-255.
[4]Ding YF, Qin HT, Yan QH, et al., 2022. Towards accurate post-training quantization for vision Transformer. Proc 30th ACM Int Conf on Multimedia, p.5380-5388.
[5]Dosovitskiy A, Beyer L, Kolesnikov A, et al., 2021. An image is worth 16×16 words: Transformers for image recognition at scale.
[6]Gao LN, Liu B, Fu P, et al., 2024. TSVT: token sparsification vision Transformer for robust RGB-D salient object detection. Patt Recogn, 148:110190.
[7]He XY, Lu Y, Liu H, et al., 2025. ORQ-ViT: outlier resilient post training quantization for vision Transformers via outlier decomposition. J Syst Architect, 168:103530.
[8]Jiang RQ, Zhang Y, Wang LG, et al., 2025. AIQViT: architecture-informed post-training quantization for vision Transformers.
[9]Jiang YF, Sun N, Xie XS, et al., 2025. ADFQ-ViT: activation-distribution-friendly post-training quantization for vision Transformers. Neur Netw, 186:107289.
[10]Kim HJ, Shin JW, Del Barrio AA, 2022. CTMQ: cyclic training of convolutional neural networks with multiple quantization steps.
[11]Kingma DP, Ba J, 2017. Adam: a method for stochastic optimization.
[12]Li MH, Halstead M, McCool C, 2024. Knowledge distillation for efficient instance semantic segmentation with Transformers. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition Workshops, p.5432-5439.
[13]Li RD, Wang Y, Liang F, et al., 2019. Fully quantized network for object detection. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.2805-2814.
[14]Li YH, Gong RH, Tan X, et al., 2021. BRECQ: pushing the limit of post-training quantization by block reconstruction.
[15]Li ZK, Xiao JR, Yang LW, et al., 2023. RepQ-ViT: scale reparameterization for post-training quantization of vision Transformers. Proc IEEE/CVF Int Conf on Computer Vision, p.17181-17190.
[16]Lin TY, Maire M, Belongie S, et al., 2014. Microsoft COCO: common objects in context. Proc 13th European Conf on Computer Vision, p.740-755.
[17]Lin Y, Zhang TY, Sun PQ, et al., 2023. FQ-ViT: post-training quantization for fully quantized vision Transformer.
[18]Liu JW, Niu L, Yuan ZH, et al., 2023. PD-Quant: post-training quantization based on prediction difference metric. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.24427-24437.
[19]Liu SY, Liu ZC, Cheng KT, 2023. Oscillation-free quantization for low-bit vision Transformers. Proc 40th Int Conf on Machine Learning, p.21813-21824.
[20]Liu Z, Lin YT, Cao Y, et al., 2021. Swin Transformer: hierarchical vision Transformer using shifted windows. Proc IEEE/CVF Int Conf on Computer Vision, p.9992-10002.
[21]Ma YX, Li HX, Zheng XW, et al., 2024. Outlier-aware slicing for post-training quantization in vision Transformer. Proc 41st Int Conf on Machine Learning, p.33811-33825.
[22]Mahmood T, Wahid A, Hong JS, et al., 2024. A novel convolution Transformer-based network for histopathology-image classification using adaptive convolution and dynamic attention. Eng Appl Artif Intell, 135:108824.
[23]Mehta S, Rastegari M, 2022. MobileViT: light-weight, general-purpose, and mobile-friendly vision Transformer.
[24]Nagel M, Fournarakis M, Bondarenko Y, et al., 2022. Overcoming oscillations in quantization-aware training. Proc 39th Int Conf on Machine Learning, p.16318-16330.
[25]Selvaraju RR, Cogswell M, Das A, et al., 2020. Grad-CAM: visual explanations from deep networks via gradient-based localization. Int J Comput Vis, 128:336-359.
[26]Tai YS, Wu AYA, 2025. AMP-ViT: optimizing vision Transformer efficiency with adaptive mixed-precision post-training quantization. IEEE/CVF Winter Conf on Applications of Computer Vision, p.6828-6837.
[27]Tai YS, Lin MG, Wu AYA, 2023. TSPTQ-ViT: two-scaled post-training quantization for vision Transformer. IEEE Int Conf on Acoustics, Speech and Signal Processing, p.1-5.
[28]Tian YC, Han JH, Chen HT, et al., 2024. Instruct-IPT: all-in-one image processing Transformer via weight modulation.
[29]Touvron H, Cord M, Douze M, et al., 2021. Training data-efficient image Transformers & distillation through attention. Proc 38th Int Conf on Machine Learning, p.10347-10357.
[30]Wei XY, Gong RH, Li YH, et al., 2023. QDrop: randomly dropping quantization for extremely low-bit post-training quantization.
[31]Xia ZR, Dai L, Chen ZH, et al., 2025. Multi-stage feature aggregation Transformer for image rain and haze joint removal. Eng Appl Artif Intell, 149:110490.
[32]Yuan ZH, Xue CH, Chen YQ, et al., 2022. PTQ4ViT: post-training quantization for vision Transformers with twin uniform quantization. 17th European Conf on Computer Vision, p.191-207.
[33]Zamir SW, Arora A, Khan S, et al., 2022. Restormer: efficient Transformer for high-resolution image restoration. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.5718-5729.
[34]Zhang JN, Peng HW, Wu K, et al., 2022. MiniViT: compressing vision Transformers with weight multiplexing. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.12135-12144.
[35]Zhang ZC, Chen ZD, Wang YX, et al., 2024. A vision Transformer for fine-grained classification by reducing noise and enhancing discriminative information. Patt Recogn, 145:109979.
[36]Zheng X, Luo YH, Zhou PY, et al., 2025. Distilling efficient vision Transformers from CNNs for semantic segmentation. Patt Recogn, 158:111029.
[37]Zhong YS, Lin MB, Li XC, et al., 2022. Dynamic dual trainable bounds for ultra-low precision super-resolution networks. 17th European Conf on Computer Vision, p.1-18.
[38]Zhong YS, Hu JW, Lin MB, et al., 2026. I&S-ViT: an inclusive & stable method for post-training ViTs quantization. IEEE Trans Patt Anal Mach Intell, 48(2):1063-1080.
[39]Zhou XC, Ding R, Wang YX, et al., 2023. Cellular binary neural network for accurate image classification and semantic segmentation. IEEE Trans Multim, 25:8064-8075.
Open peer comments: Debate/Discuss/Question/Opinion
<1>