JZUS - Journal of Zhejiang University SCIENCE

ENGINEERING Information Technology & Electronic Engineering

Accepted manuscript available online (unedited version)

Paradox of poetic intent in back-translation: evaluating the quality of large language models in Chinese translation

Author(s): Li WEIGANG, Pedro Carvalho BROM
Affiliation(s): Department of Computer Science, University of Brasilia, Brasilia 70919-900, Brazil; more
Corresponding email(s): weigang@unb.br, pedro.brom@ifb.edu.br
Key Words: Back-translation; Chinese natural language processing; Large language model-based back-translation (LLM-BT); Paradox of poetic intent; Quasi-self-awareness; Verbatim back-translation

Share this article to： More <<< Previous Paper \|Next Paper >>>

Li WEIGANG, Pedro Carvalho BROM. Paradox of poetic intent in back-translation: evaluating the quality of large language models in Chinese translation[J]. Frontiers of Information Technology & Electronic Engineering,in press.https://doi.org/10.1631/FITEE.2500298

@article{title="Paradox of poetic intent in back-translation: evaluating the quality of large language models in Chinese translation",
author="Li WEIGANG, Pedro Carvalho BROM",
journal="Frontiers of Information Technology & Electronic Engineering",
year="in press",
publisher="Zhejiang University Press & Springer",
doi="https://doi.org/10.1631/FITEE.2500298"
}

%0 Journal Article
%T Paradox of poetic intent in back-translation: evaluating the quality of large language models in Chinese translation
%A Li WEIGANG
%A Pedro Carvalho BROM
%J Frontiers of Information Technology & Electronic Engineering
%P 2176-2203
%@ 2095-9184
%D in press
%I Zhejiang University Press & Springer
doi="https://doi.org/10.1631/FITEE.2500298"

TY - JOUR
T1 - Paradox of poetic intent in back-translation: evaluating the quality of large language models in Chinese translation
A1 - Li WEIGANG
A1 - Pedro Carvalho BROM
J0 - Frontiers of Information Technology & Electronic Engineering
SP - 2176
EP - 2203
%@ 2095-9184
Y1 - in press
PB - Zhejiang University Press & Springer
ER -
doi="https://doi.org/10.1631/FITEE.2500298"

Abstract
Chinese Summary
Academic Network
Reviewer Comment

Abstract: Large language models (LLMs) excel in multilingual translation tasks, yet often struggle with culturally and semantically rich Chinese texts. This study introduces the framework of back-translation (BT) powered by LLMs, or LLM-BT, to evaluate Chinese → intermediate language → Chinese translation quality across five LLMs and three traditional systems. We construct a diverse corpus containing scientific abstracts, historical paradoxes, and literary metaphors, reflecting the complexity of Chinese at the lexical and semantic levels. Using our modular NLPMetrics system, including bilingual evaluation understudy (BLEU), character F-score (CHRF), translation edit rate (TER), and semantic similarity (SS), we find that LLMs outperform traditional tools in cultural and literary tasks. However, the results of this study uncover a high-dimensional behavioral phenomenon, the paradox of poetic intent, where surface fluency is preserved, but metaphorical or emotional depth is lost. Additionally, some models exhibit verbatim BT, suggesting a form of data-driven quasi-self-awareness, particularly under repeated or cross-model evaluation. To address BLEU’s limitations for Chinese, we propose a Jieba-segmentation BLEU variant that incorporates word-frequency and n-gram weighting, improving sensitivity to lexical segmentation and term consistency. Supplementary tests show that in certain semantic dimensions, LLM outputs approach the fidelity of human poetic translations, despite lacking a deeper metaphorical intent. Overall, this study reframes traditional fidelity vs. fluency evaluation into a richer, multi-layered analysis of LLM behavior, offering a transparent framework that contributes to explainable artificial intelligence and identifies new research pathways in cultural natural language processing and multilingual LLM alignment.

反向翻译中的诗意悖论：大语言模型的中文翻译质量评估

Li WEIGANG¹（李伟钢），Pedro Carvalho BROM²
¹巴西利亚大学计算机系，巴西巴西利亚，70919-900
²巴西利亚联邦工技学院数学系，巴西巴西利亚，71200-020
摘要：大语言模型（LLMs）在多语言翻译任务中成效卓著，但在处理内涵深蕴、语义复杂的中文时面临挑战。本文提出基于大语言模型的反向翻译（LLM-BT）框架，基于"中文→中间语言→中文"翻译流程，评价翻译质量。研究涵盖5个主流LLM与3种传统翻译工具，构建了多样化语料库，包括科学摘要、历史悖论和文学隐喻，以反映中文在词汇与语义层面的复杂性。构建了NLPMetrics评价体系，涉及双语评估分数（BLEU）、字符F1测度（CHRF）、翻译编辑率（TER）及语义相似度（SS）指标。实验结果表明，LLM在文学类任务中普遍优于传统工具。同时也揭示一种高维行为现象--诗意悖论，即模型往往能保持翻译表面流畅，却削弱了隐喻与情感深度。此外，部分模型表现出逐字回译倾向，在重复或跨模型测试下呈现出数据驱动的"准自我意识"。为改善BLEU在汉语评估中的局限性，，本文提出融合结巴分词与词频加权的改进型BLEU，有效提升了对词汇切分与术语一致性的敏感度。补充实验显示，在部分语义维度上，LLM输出已接近人工诗歌翻译的忠实度，但仍缺乏深层次的隐喻表达。本文将传统的"忠实度-流畅度"评价拓展为多维度的LLM行为分析，提供了一个促进可解释人工智能发展的透明框架，并为文化自然语言处理及多语言LLM对齐等领域指出新的研究路径。

关键词组：反向翻译；中文自然语言处理；基于大语言模型的反向翻译（LLM-BT）；诗意悖论；准自我意识；逐字回译

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

[1]Aiken M, Park M, 2010. The efficacy of round-trip translation for MT evaluation. Transl J, 14(1):1-10.

[2]Arruda-Vasconcelos R, Louzada LM, Feres M, et al., 2021. Investigation of microbial profile, levels of endotoxin and lipoteichoic acid in teeth with symptomatic irreversible pulpitis: a clinical study. Int Endod J, 54(1):46-64.

[3]Artetxe M, Labaka G, Agirre E, 2018. Unsupervised statistical machine translation. Proc Conf on Empirical Methods in Natural Language Processing, p.3632-3642.

[4]Bahji A, Acion L, Laslett AM, et al., 2023. Exclusion of the non-English-speaking world from the scientific literature: recommendations for change for addiction journals and publishers. Nord Stud Alcohol Drugs, 40(1):6-13.

[5]Baker M, 2018. In Other Words: a Coursebook on Translation (3^rd Ed.). Routledge, London, UK.

[6]Berman A, Venuti L, 2021. Translation and the Trials of the Foreign. In: Venuti L (Ed.), The Translation Studies Reader (4^th Ed.). Routledge, London, UK, p.247-260.

[7]Brimacombe B, Zhou JW, 2023. Quick back-translation for unsupervised machine translation. Proc Findings of the Association for Computational Linguistics, p.8521-8534.

[8]Brown TB, Mann B, Ryder N, et al., 2020. Language models are few-shot learners. Proc 34^th Int Conf on Neural Information Processing Systems, Article 159.

[9]Cao Z, Lu J, Cui S, et al., 2020. Zero-shot handwritten Chinese character recognition with hierarchical decomposition embedding. Patt Recogn, 107:107488.

[10]Chan SW, 2004. A Dictionary of Translation Technology. The Chinese University of Hong Kong Press, Hong Kong, China (in Chinese).

[11]Chen AD, Lou LZ, Chen KH, et al., 2024a. Benchmarking LLMs for translating classical Chinese poetry: evaluating adequacy, fluency, and elegance. https://arxiv.org/abs/2408.09945

[12]Chen AD, Lou LZ, Chen KH, et al., 2024b. DUAL-REFLECT: enhancing large language models for reflective translation through dual learning feedback mechanisms. Proc 62^nd Annual Meeting of the Association for Computational Linguistics, p.693-704.

[13]Chung JB, Kim T, 2025. Leveraging large language models for enhanced back-translation: techniques and applications. IEEE Access, 13:61322-61328.

[14]Degroot AMB, Dannenburg L, Vanhell JG, 1994. Forward and backward word translation by bilinguals. J Mem Lang, 33(5):600-629.

[15]Demšar J, 2006. Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res, 7:1-30.

[16]Ding Y, Teng F, Zhang P, et al., 2021. Research on text information mining technology of substation inspection based on improved Jieba. Proc Int Conf on Wireless Communications and Smart Grid, p.561-564.

[17]Eberhard DM, Simons GF, Fennig CD, 2022. Ethnologue: Languages of the World (25^th Ed.). SIL International, Dallas, USA.

[18]Edunov S, Ott M, Auli M, et al., 2018. Understanding back-translation at scale. Proc Conf on Empirical Methods in Natural Language Processing, p.489-500.

[19]Feng SB, 2024. Discussion on applied chemical metrology calculation based on computer technology—comment on “applied chemistry”. Chin J Appl Chem, 41(12):1829-1830 (in Chinese).

[20]Gain B, Bandyopadhyay D, Ekbal A, 2025. Bridging the linguistic divide: a survey on leveraging large language models for machine translation. https://arxiv.org/abs/2504.01919

[21]Glidden-Tracey C, Greenwood AK, 1997. A validation study of the Spanish self-directed search using back-translation procedures. J Career Assess, 5(1):105-113.

[22]He J, 2019. The Chinese nomenclature for the heterocyclic compounds since 1932. Chemistry, 82(4):373-378 (in Chinese).

[23]He YJ, Hou LP, Lang LY, 2023. L2 Acquisition from Perspectives of Professional Translation and Interpreting. In: Maqbool T, Lang LY, Meltzoff K (Eds.), Second Language Acquisition—Learning Theories and Recent Approaches. IntechOpen, p.85.

[24]Hoang VCD, Koehn P, Haffari G, et al., 2018. Iterative back-translation for neural machine translation. Proc 2^nd Workshop on Neural Machine Translation and Generation, p.18-24.

[25]Jiang JY, Liu C, 2020. Comparison and analysis of research and development expenditure and publication output of major countries (regions) in the world. Bull Natl Nat Sci Found China, 34(3):367-372 (in Chinese).

[26]Klaudy K, 1996. Back-translation as a tool for detecting explicitation strategies in translation. In: Klaudy K, Lambert J, Sohár A (Eds.), Translation Studies in Hungary. Scholastica, Budapest, Hungary, p.99-114.

[27]Kroll JF, Stewart E, 1994. Category interference in translation and picture naming: evidence for asymmetric connections between bilingual memory representations. J Mem Lang, 33(2):149-174.

[28]La Heij W, Hooglander A, Kerling R, et al., 1996. Nonverbal context effects in forward and backward word translation: evidence for concept mediation. J Mem Lang, 35(5):648-665.

[29]Li HZ, Sha J, Shi C, 2020. Revisiting back-translation for low-resource machine translation between Chinese and Vietnamese. IEEE Access, 8:119931-119939.

[30]Li YH, Huang HY, Wang BJ, et al., 2025. DRMSpell: dynamically reweighting multimodality for Chinese spelling correction. Front Inform Technol Electron Eng, 26(3):354-366.

[31]Ling L, Lin CH, Lin TY, et al., 2025. Scenethesis: a language and vision agentic framework for 3D scene generation. https://arxiv.org/abs/2505.02836

[32]Liu Y, Liang NY, 1986. Hanyu chuli de jichu gongcheng—xiandai hanyu cifrequency tongji. J Chin Inform Process, 1(1):17-25 (in Chinese).

[33]Luo RX, Xu JJ, Zhang Y, et al., 2019. PKUSEG: a toolkit for multi-domain Chinese word segmentation. https://arxiv.org/abs/1906.11455

[34]Ma WW, 2024. Effect of amino oligosaccharides combined with chemical fungicides on the control of downy mildew in Chinese cabbage. Contemp Farm Mach, (12):68-69 (in Chinese).

[35]Marivate V, Sefara T, 2020. Improving short text classification through global augmentation methods. Proc CICLing, p.234-246.

[36]Modarressi A, Köksal A, Imani A, et al., 2024. MemLLM: finetuning LLMs to use an explicit read-write memory. https://arxiv.org/abs/2404.11672

[37]Nam GE, Park YG, 2015. Re: Inhibition of peripheral FAAH depresses activities of bladder mechanosensitive nerve fibers of the rat. J Urol, 193(2):738-739.

[38]Nida EA, 1964. Toward a Science of Translating: with Special Reference to Principles and Procedures Involved in Bible Translating. Brill Archive, Leiden, the Netherlands.

[39]Ozolins U, Hale S, Cheng X, et al., 2020. Translation and back-translation methodology in health research—a critique. Expert Rev Pharmacoecon Outcomes Res, 20(1):69-77.

[40]Papineni K, Roukos S, Ward T, et al., 2002. BLEU: a method for automatic evaluation of machine translation. Proc 40^th Annual Meeting of the Association for Computational Linguistics, p.311-318.

[41]Qiang JP, Li Y, Zhang CW, et al., 2023. Chinese idiom paraphrasing. Trans Assoc Comput Ling, 11:740-754.

[42]Salamoura A, Williams JN, 1999. Backward word translation: lexical vs. conceptual mediation or “concept activation vs. word retrieval”? RCEAL Work Pap Engl Appl Ling, 6:31-56.

[43]Schäffner C, 2004. Metaphor and translation: some implications of a cognitive approach. J Pragmat, 36(7):1253-1269.

[44]Sennrich R, Haddow B, Birch A, 2016. Improving neural machine translation models with monolingual data. Proc 54^th Annual Meeting of the Association for Computational Linguistics, p.86-96.

[45]Shan LL, Luo SX, Zhu ZZ, et al., 2025. Cognitive memory in large language models. https://arxiv.org/abs/2504.02441

[46]Sheldon MR, Fillyaw MJ, Thompson WD, 1996. The use and interpretation of the Friedman test in the analysis of ordinal-scale data in repeated measures designs. Physioth Res Int, 1(4):221-228.

[47]Somers H, 2005. Round-trip translation: what is it good for? Proc Australasian Language Technology Workshop, p.127-133.

[48]Sun ZJ, Li XY, Sun XF, et al., 2021. ChineseBERT: Chinese pretraining enhanced by glyph and pinyin information. Proc 59^th Annual Meeting of the Association for Computational Linguistics and the 11^th Int Joint Conf on Natural Language Processing, p.2065-2075.

[49]Taheri A, Zamanifar A, Farhadi A, 2025. Enhancing aspect-based sentiment analysis using data augmentation based on back-translation. Int J Data Sci Anal, 19(3):491-516.

[50]Tao Z, Che YF, Xi DH, et al., 2024. Towards reliable detection of LLM-generated texts: a comprehensive evaluation framework with CUDRT. https://arxiv.org/abs/2406.09056

[51]Toral A, Way A, 2018. What level of quality can neural machine translation attain on literary text? In: Moorkens J, Castilho S, Gaspari F, et al. (Eds.), Translation Quality Assessment: from Principles to Practice. Springer, Cham, p.263-287.

[52]Troiano E, Klinger R, Padó S, 2020. Lost in back-translation: emotion preservation in neural machine translation. Proc 28^th Int Conf on Computational Linguistics, p.4340-4354.

[53]Tu QY, Li CB, 2017. A review on textless back translation of China-themed works written in English. Stud Lit Lang, 14(1):1-7.

[54]Vaswani A, Shazeer N, Parmar N, et al., 2017. Attention is all you need. Proc 31^st Int Conf on Neural Information Processing Systems, p.6000-6010.

[55]Waijanya S, Mingkhwan A, 2014. Thai poetry translation to English with backward translation evaluation. Proc 9^th Int Conf on Digital Information Management, p.248-253.

[56]Wang HY, 2009. Introduction to Literary Translation Criticism. China Renmin University Press, Beijing, China (in Chinese).

[57]Wei JQ, Ren XZ, Li XG, et al., 2019. NEZHA: neural contextualized representation for Chinese language understanding. https://arxiv.org/abs/1909.00204

[58]Weigang L, Brom PC, 2025. LLM-BT-terms: back-translation as a framework for terminology standardization and dynamic semantic embedding. https://arxiv.org/abs/2506.08174

[59]Weigang L, Marinho MC, Li DL, et al., 2024. Six-writings multimodal processing with pictophonetic coding to enhance Chinese language models. Front Inform Technol Electron Eng, 25(1):84-105.

[60]Weigang L, Brom PC, Ramos RM, 2025a. Quantitative evaluation of translation quality and computational efficiency in semantic vs. phonetic strategies for Chinese scientific terms. Proc 29^th Int Conf on Asian Language Processing, p.43-48.

[61]Weigang L, Ramos RM, Brom PC, et al., 2025b. Threshold study for Hanzi image recognition: defining character and component limits in Chinese, Japanese, and Korean script processing. Int J Asian Lang Process, 35(1):2450011.

[62]Wong KF, Li WJ, Xu RF, et al., 2010. Introduction to Chinese Natural Language Processing. Springer, Cham, Germany.

[63]Wu MM, Hu YX, Zhang YC, et al., 2024. Mitigating idiom inconsistency: a multi-semantic contrastive learning method for Chinese idiom reading comprehension. Proc 38^th AAAI Conf on Artificial Intelligence, p.19243-19251.

[64]Yang HK, Lin ZH, Wang WJ, et al., 2024. Memory³: language modeling with explicit memory. https://arxiv.org/abs/2407.01178

[65]Yang YX, Ren GC, 2020. HanLP-based technology function matrix construction on Chinese process patents. Int J Mob Comput Multim Commun, 11(3):48-64.

[66]Yousufi S, Erdely F, 2024. Enhancing nonparametric tests: insights for computational intelligence and data mining. Res Acad Innov Data Anal, 1(3):214-226.

[67]Yung C, Dolatabadi HM, Erfani S, et al., 2025. Round trip translation defence against large language model jailbreaking attacks. Proc Workshops, ADUR, FairPC, GLFM, PM4B and RAFDA Trends and Applications in Knowledge Discovery and Data Mining, p.286-297.

[68]Zhang XE, 2021. A study of cultural context in Chinese–English translation. Reg-Educ Res Rev, 3(2):11-14.

[69]Zhang Y, Shuai YH, Xiao CY, et al., 2025. The structure of the bilingual lexicon: evidence from a semantic blocked word translation task with Chinese–English bilinguals. Second Lang Res, early access.

[70]Zhang ZY, Bo XH, Ma C, et al., 2024. A survey on the memory mechanism of large language model based agents. https://arxiv.org/abs/2404.13501

[71]Zhao SQ, Zhou YH, Ren YP, et al., 2025. Fùxì: a benchmark for evaluating language models on ancient Chinese. https://arxiv.org/abs/2503.15837

[72]Zhong CZ, Cheng F, Liu QY, et al., 2024. Beyond English-centric LLMs: what language do multilingual language models think in? https://arxiv.org/abs/2408.10811

[73]Zhou Z, 2014. The six principles of Chinese writing and its application to design as design idea. Stud Lit Lang, 8(3):84-88.

[74]Zhu SL, Pan LY, Jian D, et al., 2025. Overcoming language barriers via machine translation with sparse mixture-of-experts fusion of large language models. Inform Process Manag, 62(3):104078.

[75]Zhuo TY, Xu QK, He XL, et al., 2023. Rethinking round-trip translation for machine translation evaluation. Proc Findings of the Association for Computational Linguistics, p.319-337.

Open peer comments: Debate/Discuss/Question/Opinion

<1>

- Go to

反向翻译中的诗意悖论：大语言模型的中文翻译质量评估

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference