АНАЛИЗ НЕЙРОСЕТЕВЫХ АРХИТЕКТУР РАСПОЗНАВАНИЯ ДИКТОРОВ

Никита Павлович Колмаков; Андрей Николаевич Голубинский

doi:10.17308/sait/1995-5499/2025/3/88-100

Authors

Nikita Pavlovich Kolmakov Institute for Information Transmission Problems of the Russian Academy of Sciences https://orcid.org/0009-0003-3880-8056 (unauthenticated)
Andrey Nikolaevich Golubinskiy Institute for Information Transmission Problems of the Russian Academy of Sciences https://orcid.org/0009-0002-4805-7391 (unauthenticated)

DOI:

https://doi.org/10.17308/sait/1995-5499/2025/3/88-100

Keywords:

deep neural network, speaker recognition, voice deep features, biometry

Abstract

The paper is devoted to the study of state-of-the-art speaker recognition methods based on neural network modeling. The work focuses on the secondary parameterization of the speech signal, which is performed before processing by the neural network. The relevance of the presented material is due to the emergence of new areas in which it is more appropriate to use voice as a biometric key, therefore, in order to successfully create an access separation system, it is necessary to have reliable information about state-of-the-art solutions. The aim of the work is to study and analyze speaker recognition methods that use various architectural solutions (convolutional neural networks and language models) to extract the speaker’s unique voice characteristics. T he presented evaluation of the methods is based on the metric — the Error Equal Rate, which is the intersection of errors of the first and second kind. Using this metric allows us to estimate the distribution of hidden speaker representations. The analysis process takes place using two test English-language datasets: VoxCeleb-1 and Common Voice 19, corresponding to the conditions under which the process of personality recognition can occur. During the analysis on test datasets, it was found that the hidden internal spaces of the models have a shift towards the maximum value of the error of the first or second kind. The estimated shift is determined using the threshold value, which is used to make a decision about the similarity of speakers. The directions for research are proposed, which will make it possible to perform a high-quality procedure for recognizing speakers by voice in a multilingual domain. The paper presents the results of an additional analysis of the neural network models under consideration in a new language domain —Russian.

Author Biographies

Nikita Pavlovich Kolmakov, Institute for Information Transmission Problems of the Russian Academy of Sciences

Junior Researcher
Andrey Nikolaevich Golubinskiy, Institute for Information Transmission Problems of the Russian Academy of Sciences

Ph.D., Dr. Sci. (Eng.), Associate Professor, Acting Deputy, head of the Department of Machine Learning and Pattern Recognition, head of the Laboratory of Data Mining and Predictive Modeling. Director for Research

References

Сорокин В. Н., Цыплихин А. И. Верификация диктора по спектрально-временным параметрам речевого сигнала // Информационные технологии в технических и социально-экономических системах. – 2010. – 10, № 2. – С. 87–104.

Kabir M. M. [et al.] A Survey of Speaker Recognition: Fundamental Theories, Recognition Methods and Opportunities // IEEE Access. – 2021. – Vol. 9. – P. 79236–79263.

Zeinali H. [et al.] BUT System Description to VoxCeleb Speaker Recognition Challenge 2019: arXiv:1910.12592. arXiv, 2019.

Kaye D. H. The error of equal error rates // Law, Probability and Risk. – 2002. – Vol. 1, № 1. – С. 3–8.

Wang H. [et al.] CAM++: A Fast and Efficient Network for Speaker Verification Using Context-Aware Masking: arXiv:2303.00332. arXiv, 2023.

Desplanques B., Thienpondt J., Demuynck K. ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification // Interspeech 2020. – 2020. – P. 3830–3834.

Chen S. [et al.] WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing // IEEE J. Sel. Top. Signal Process. – 2022. – Vol. 16, № 6. – P. 1505–1518.

Baevski A. [et al.] wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations: arXiv:2006.11477. arXiv, 2020.

Hsu W.-N. [et al.] HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units: arXiv:2106.07447. arXiv, 2021.

Yakovlev I. [et al.] Reshape Dimensions Network for Speaker Recognition // Interspeech 2024. – 2024. – P. 3235–3239.

Nagrani A., Chung J. S., Zisserman A. VoxCeleb: a large-scale speaker identification dataset // Interspeech 2017. – 2017. – P. 2616–2620.

Mozilla Common Voice. – Режим доступа: https://commonvoice.mozilla.org/. (дата обращения: 08.01.2024).

Wan L. [et al.] Generalized End-to-End Loss for Speaker Verification: arXiv:1710.10467. arXiv, 2020.

Jung J. [et al.] D-vector based speaker verification system using Raw Waveform CNN // Proceedings of the 2017 International Seminar on Artificial Intelligence, Networking and Information Technology (ANIT 2017). Bangkok, Thailand: Atlantis Press, 2018.

Peddinti V., Povey D., Khudanpur S. A time delay neural network architecture for efficient modeling of long temporal contexts // Intespeech 2015. – ISCA, 2015. – P. 3214–3218.

Snyder D. [et al.] X-Vectors: Robust DNN Embeddings for Speaker Recognition // 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Calgary, AB: IEEE, 2018. – P. 5329–5333.

Chung J. S., Nagrani A., Zisserman A. VoxCeleb2: Deep Speaker Recognition // Interspeech 2018. – 2018. – P. 1086–1090.

Crochiere R. E., Rabiner L. R. Multirate digital signal processing // Signal Processing. – 1983. – Vol. 5, № 5. – P. 469–470.

He K. [et al.] Deep Residual Learning for Image Recognition: arXiv:1512.03385. arXiv, 2015.

Yu Y.-Q. [et al.] Cam: Context-Aware Masking for Robust Speaker Verification // ICASSP 2021 – 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Toronto, ON, Canada: IEEE, 2021. – P. 6703–6707.

Хайкин С. Нейронные сети. Полный курс. Москва : Издательский дом «Вильямс» 2006. – 1104 с.

Vaswani A. [et al.] Attention Is All You Need: arXiv:1706.03762. arXiv, 2023.

facebookresearch/libri-light: dataset for lightly supervised training using the librivox audio book recordings. https://librivox.org/. – Режим доступа: GitHub: https://github.com/ facebookresearch/libri-light (дата обращения: 02.02.2025).

Chen G. [et al.] GigaSpeech: An Evolving, Multi-Domain ASR Corpus with 10,000 Hours of Transcribed Audio // Interspeech 2021. ISCA, 2021. – P. 3670–3674.

Wang C. [et al.] VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation: arXiv:2101.00390. arXiv, 2021.

Hendrycks D., Gimpel K. Gaussian Error Linear Units (GELUs): arXiv:1606.08415. arXiv, 2023.

Cai W., Chen J., Li M. Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System: arXiv:1804.05160. arXiv, 2018.

Lin Y. [et al.] VoxBlink2: A 100K+ Speaker Recognition Corpus and the Open-Set Speaker-Identification Benchmark: arXiv:2407.11510. arXiv, 2024.

Ciresan D., Meier U., Schmidhuber J. Multi-column deep neural networks for image classification // 2012 IEEE Conference on Computer Vision and Pattern Recognition. Providence, RI: IEEE, 2012. – P. 3642–3649.

Голубинский А. Н. Метод аналитического расчета параметров математических моделей речевого сигнала, построенных на основе теории модуляции // Системы управления и информационные технологии. – 2009. – № 1.3. – C. 332–336.

Голубинский А. Н. Метод оценки частоты основного тона речевого сигнала на основе минимума невязки коэффициентов корреляции // Телекоммуникации. – 2009. – Vol. 8. – P. 16–21.

Сорокин В. Н. Речевые процессы: монография. – Москва : Народное образование, 2012. – 599 с.