Journal of Zhejiang University

ENGINEERING Information Technology & Electronic Engineering

Accepted manuscript available online (unedited version)

Dynamic prompting class distribution optimization for semi-supervised sound event detection

Author(s): Lijian GAO, Qing ZHU, Yaxin SHEN, Qirong MAO, Yongzhao ZHAN
Affiliation(s): School of Computer Science and Communication Engineering, Jiangsu University, Zhenjiang 212016, China; more
Corresponding email(s): ljgao@ujs.edu.cn, mao_qr@ujs.edu.cn
Key Words: Prompt tuning; Class distribution learning; Semi-supervised learning; Sound event detection

Share this article to： More <<< Previous Paper \|Next Paper >>>

Lijian GAO, Qing ZHU, Yaxin SHEN, Qirong MAO, Yongzhao ZHAN. Dynamic prompting class distribution optimization for semi-supervised sound event detection[J]. Frontiers of Information Technology & Electronic Engineering,in press.https://doi.org/10.1631/FITEE.2400061

@article{title="Dynamic prompting class distribution optimization for semi-supervised sound event detection",
author="Lijian GAO, Qing ZHU, Yaxin SHEN, Qirong MAO, Yongzhao ZHAN",
journal="Frontiers of Information Technology & Electronic Engineering",
year="in press",
publisher="Zhejiang University Press & Springer",
doi="https://doi.org/10.1631/FITEE.2400061"
}

%0 Journal Article
%T Dynamic prompting class distribution optimization for semi-supervised sound event detection
%A Lijian GAO
%A Qing ZHU
%A Yaxin SHEN
%A Qirong MAO
%A Yongzhao ZHAN
%J Frontiers of Information Technology & Electronic Engineering
%P 556-567
%@ 2095-9184
%D in press
%I Zhejiang University Press & Springer
doi="https://doi.org/10.1631/FITEE.2400061"

TY - JOUR
T1 - Dynamic prompting class distribution optimization for semi-supervised sound event detection
A1 - Lijian GAO
A1 - Qing ZHU
A1 - Yaxin SHEN
A1 - Qirong MAO
A1 - Yongzhao ZHAN
J0 - Frontiers of Information Technology & Electronic Engineering
SP - 556
EP - 567
%@ 2095-9184
Y1 - in press
PB - Zhejiang University Press & Springer
ER -
doi="https://doi.org/10.1631/FITEE.2400061"

Abstract
Chinese Summary
Academic Network
Reviewer Comment

Abstract: Semi-supervised sound event detection (SSED) tasks typically leverage a large amount of unlabeled and synthetic data to facilitate model generalization during training, reducing overfitting on a limited set of labeled data. However, the generalization training process often encounters challenges from noisy interference introduced by pseudo-labels or domain knowledge gaps. To alleviate noisy interference in class distribution learning, we propose an efficient semi-supervised class distribution learning method through dynamic prompt tuning, named prompting class distribution optimization (PADO). Specifically, when modeling real labeled data, PADO dynamically incorporates independent learnable prompt tokens to explore prior knowledge about the true distribution. Then, the prior knowledge serves as prompt information, dynamically interacting with the posterior noisy-class distribution information. In this case, PADO achieves class distribution optimization while maintaining model generalization, leading to a significant improvement in the efficiency of class distribution learning. Compared with state-of-the-art methods on the SSED datasets from DCASE 2019, 2020, and 2021 challenges, PADO achieves significant performance improvements. Furthermore, it is readily extendable to other benchmark models.

基于动态提示类分布优化的半监督声音事件检测方法

高利剑¹，朱青¹，沈雅馨¹，毛启容^1,2，詹永照¹
¹江苏大学计算机科学与通信工程学院，中国镇江市，212016
²江苏省大数据泛在感知与智能农业应用工程研究中心，中国镇江市，212016
摘要：半监督声音事件检测任务通常利用大规模无标签数据和合成数据提升模型的泛化能力，从而有效降低模型在少量有标注数据上的过拟合。然而，泛化训练过程通常伴随伪标签噪声和域知识差异造成的干扰。为缓解半噪声干扰类分布学习的问题，提出一种基于动态提示优化的半监督类分布学习方法（PADO）。具体而言，当给定真实标签数据时，PADO动态嵌入一组可学习的独立参数（类令牌）以挖掘真实分布的先验知识，作为额外提示信息，与带噪后验分布知识动态交互，从而实现类分布知识的优化，并保留模型泛化性能。基于此，PADO能够显著提升类分布学习效率。在DCASE2019、2020及2021数据集上的实验结果表明，PADO明显优于当前先进方法，且易于迁移至其他主流模型。

关键词组：提示优化；类分布学习；半监督学习；声音事件检测

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

[1]Bilen Ç, Ferroni G, Tuveri F, et al., 2020. A framework for the robust evaluation of sound event detection. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.61-65.

[2]Brown TB, Mann B, Ryder N, et al., 2020. Language models are few-shot learners. Proc 34^th Int Conf on Neural Information Processing Systems, Article 159.

[3]Chan TK, Chin CS, 2021. Detecting sound events using convolutional macaron net with pseudo strong labels. Proc IEEE 23^rd Int Workshop on Multimedia Signal Processing, p.1-6.

[4]Crocco M, Cristani M, Trucco A, et al., 2016. Audio surveillance: a systematic review. ACM Comput Surv, 48(4):52.

[5]Dinkel H, Wu MY, Yu K, 2021. Towards duration robust weakly supervised sound event detection. IEEE/ACM Trans Audio Speech Lang Process, 29:887-900.

[6]Fu YW, Xu KL, Mi HB, et al., 2019. A mobile application for sound event detection. Proc 28^th Int Joint Conf on Artificial Intelligence, p.1-7.

[7]Gao LJ, Mao QR, Dong M, et al., 2019. On learning disentangled representation for acoustic event detection. Proc 27^th ACM Int Conf on Multimedia, p.2006-2014.

[8]Gao LJ, Zhou L, Mao QR, et al., 2022. Adaptive hierarchical pooling for weakly-supervised sound event detection. Proc 30^th ACM Int Conf on Multimedia, p.1779-1787.

[9]Gao LJ, Mao QR, Dong M, 2023. Joint-Former: jointly regularized and locally down-sampled Conformer for semi-supervised sound event detection. Proc 24^th Annual Conf of the Int Speech Communication Association, p.2753-2757.

[10]Gao LJ, Mao QR, Dong M, 2024. On local temporal embedding for semi-supervised sound event detection. IEEE/ACM Trans Audio Speech Lang Process, 32:1687-1698.

[11]Gao TY, Fisch A, Chen DQ, 2021. Making pre-trained language models better few-shot learners. Proc 59^th Annual Meeting of the Association for Computational Linguistics and 11^th Int Joint Conf on Natural Language Processing, p.3816-3830.

[12]Gemmeke JF, Ellis DPW, Freedman D, et al., 2017. Audio Set: an ontology and human-labeled dataset for audio events. IEEE Int Conf on Acoustics, Speech and Signal Processing, p.776-780.

[13]Gu YX, Han X, Liu ZY, et al., 2022. PPT: pre-trained prompt tuning for few-shot learning. Proc 60^th Annual Meeting of the Association for Computational Linguistics, p.8410-8423.

[14]Gu ZD, He KJ, 2024. Affective prompt-tuning-based language model for semantic-based emotional text generation. Int J Semantic Web Inform Syst, 20(1):1-19.

[15]Guan YD, Xue JB, Zheng GB, et al., 2022. Sparse self-attention for semi-supervised sound event detection. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.821-825.

[16]Gulati A, Qin J, Chiu CC, et al., 2020. Conformer: convolution-augmented Transformer for speech recognition. Proc 21^st Annual Conf of the Int Speech Communication Association, p.5036-5040.

[17]Imoto K, Tonami N, Koizumi Y, et al., 2020. Sound event detection by multitask learning of sound events and scenes with soft scene labels. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.621-625.

[18]Jia ML, Tang LM, Chen BC, et al., 2022. Visual prompt tuning. Proc 17^th European Conf on Computer Vision, p.709-727.

[19]Koh CY, Chen YS, Liu YW, et al., 2021. Sound event detection by consistency training and pseudo-labeling with feature-pyramid convolutional recurrent neural networks. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.376-380.

[20]Kong QQ, Xu Y, Wang WW, et al., 2020. Sound event detection of weakly labelled data with CNN-Transformer and automatic threshold optimization. IEEE/ACM Trans Audio Speech Lang Process, 28:2450-2460.

[21]Li YX, Liu ML, Drossos K, et al., 2020. Sound event detection via dilated convolutional recurrent neural networks. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.286-290.

[22]Lin LW, Wang XD, Liu H, et al., 2020. Guided learning for weakly-labeled semi-supervised sound event detection. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.626-630.

[23]Mesaros A, Heittola T, Virtanen T, 2016. Metrics for polyphonic sound event detection. Appl Sci, 6(6):162.

[24]Mesaros A, Heittola T, Virtanen T, et al., 2021. Sound event detection: a tutorial. IEEE Signal Process Mag, 38(5):67-83.

[25]Miyazaki K, Komatsu T, Hayashi T, et al., 2020a. Conformer-based sound event detection with semi-supervised learning and data augmentation. Proc 5^th Workshop on Detection and Classification of Acoustic Scenes and Events, p.100-104.

[26]Miyazaki K, Komatsu T, Hayashi T, et al., 2020b. Weakly-supervised sound event detection with self-attention. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.66-70.

[27]Murugesan B, Hussain R, Bhattacharya R, et al., 2024. Prompting classes: exploring the power of prompt class learning in weakly supervised semantic segmentation. Proc IEEE/CVF Winter Conf on Applications of Computer Vision, p.290-301.

[28]Park JS, Kim SH, 2020. Sound learning-based event detection for acoustic surveillance sensors. Multimed Tools Appl, 79(23-24):16127-16139.

[29]Serizel R, Turpault N, Shah A, et al., 2020. Sound event detection in synthetic domestic environments. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.86-90.

[30]Singhal K, Azizi S, Tu T, et al., 2023. Large language models encode clinical knowledge. Nature, 620:172-180.

[31]Sohn K, Chang H, Lezama J, et al., 2023. Visual prompt tuning for generative transfer learning. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.19840-19851.

[32]Tarvainen A, Valpola H, 2017. Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. 31^st Int Conf on Neural Information Processing Systems, p.1195-1204.

[33]Turpault N, Serizel R, Shah AP, et al., 2019. Sound event detection in domestic environments with weakly labeled data and soundscape synthesis. Workshop on Detection and Classification of Acoustic Scenes and Events, p.253-257.

[34]Turpault N, Wisdom S, Erdogan H, et al., 2020. Improving sound event detection in domestic environments using sound separation. 5^th Workshop on Detection and Classification of Acoustic Scenes and Events, p.205-209.

[35]Wakayama K, Saito S, 2022. CNN-Transformer with self-attention network for sound event detection. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.806-810.

[36]Wang YH, Chauhan J, Wang W, et al., 2023. Universality and limitations of prompt tuning. 37^th Int Conf on Neural Information Processing Systems, Article 3305.

[37]Wisdom S, Erdogan H, Ellis DPW, et al., 2021. What’s all the fuss about free universal sound separation data? Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.186-190.

[38]Xu H, Xie HT, Tan QF, et al., 2023. Meta semi-supervised medical image segmentation with label hierarchy. Health Inform Sci Syst, 11(1):26.

[39]Yan J, Song Y, Dai LR, et al., 2020. Task-aware mean teacher method for large scale weakly labeled semi-supervised sound event detection. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.326-330.

[40]Zheng X, Song Y, Dai LR, et al., 2021a. An effective mutual mean teaching based domain adaptation method for sound event detection. Proc 22^nd Annual Conf of the Int Speech Communication Association, p.556-560.

[41]Zheng X, Song Y, McLoughlin I, et al., 2021b. An improved mean teacher based method for large scale weakly labeled semi-supervised sound event detection. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.356-360.

Open peer comments: Debate/Discuss/Question/Opinion

<1>