Оценка точности субъектозависимого подхода к обнаружению синтезированного голоса

Михаил Витальевич Евсюков; Михаил Михайлович Путято; Александр Самвелович Макарян; Александр Николаевич Черкасов

doi:10.17308/sait/1995-5499/2024/1/77-93

Authors

Mikhail V. Evsyukov Kuban State Technological University https://orcid.org/0000-0001-7101-6251 (unauthenticated)
Mikhail M. Putyato Kuban State Technological University https://orcid.org/0000-0003-0414-6034 (unauthenticated)
Alexander S. Makaryan Kuban State Technological University https://orcid.org/0000-0002-1801-6137 (unauthenticated)
Alexander N. Chekrasov Kuban State Technological University https://orcid.org/0000-0002-5015-4556 (unauthenticated)

DOI:

https://doi.org/10.17308/sait/1995-5499/2024/1/77-93

Keywords:

spoofing, presentation attack, biometrics, synthesized voice, voice authentication, speaker recognition, Gaussian mixture model, LFCC

Abstract

Modern speaker recognition systems display high accuracy while processing bonafide human voices. However, vulnerability to spoofing-attacks is their primary disadvantage. The field of spoofing-attacks detection is currently dominated by speaker-independent systems. In spite of this, there are studies showing the promise of a speaker-specific approach to spoofing detection. Nevertheless, the efficiency of speaker-specific systems of logical access spoofing detection has not been studied previously. The purpose of this research is to compare the accuracy demonstrated by speaker-specific and speaker-independent versions of the same logical access spoofing detection system. In addition, we evaluate the impact of such factors as the training method used for creating speaker-specific models and the available amount of speaker-specific training data on the accuracy of logical access spoofing detection. We used ASVspoof 2019 LA dataset and LFCC-GMM spoofing detection system to conduct the experiments. The accuracy of the systems was measured in terms of equal error rate (EER). As a result, we discovered that the use of speaker-specific models of bonafide speech enabled significant improvement of the accuracy of spoofing detection, without changing the feature extraction algorithms or machine learning models used. Additionally, increasing the amount of data used for creating speaker-specific models has proven to be an effective way to improve the accuracy spoofing detection. We consider that it is optimal to use speaker-specific models of bonafide data together with speaker-independent models of spoofed data. Such an approach resulted into reducing the EER from 16.86% to 9.71% when using a speaker-specific training dataset of 90 records.

Author Biographies

Mikhail V. Evsyukov, Kuban State Technological University

postgraduate student, Department of Cybersecurity and Information Protection, Kuban State Technological University
Mikhail M. Putyato, Kuban State Technological University

Associate Professor, Department of Cybersecurity and Information Protection, Kuban State Technological University
Alexander S. Makaryan, Kuban State Technological University

Ph. D., Associate Professor, Head of Department of Cybersecurity and Information Protection, Kuban State Technological University
Alexander N. Chekrasov, Kuban State Technological University

Ph. D., Associate Professor, Head of the Research Center for Computer Technologies, Control Systems and Integrated Security, Kuban State Technological University