МНОГОМОДАЛЬНАЯ ГЕНЕРАЦИЯ РЕЧИ, МИМИКИ И ЖЕСТОВ  В ЦИФРОВЫХ АВАТАРАХ: СОВРЕМЕННЫЕ МЕТОДЫ И ПЕРСПЕКТИВЫ

Александр Александрович Аксёнов

doi:10.17308/sait/1995-5499/2025/4/155-182

Александр Александрович Аксёнов Санкт-Петербургский Федеральный исследовательский центр Российской академии наук https://orcid.org/0000-0002-7479-2851

DOI: https://doi.org/10.17308/sait/1995-5499/2025/4/155-182

Ключевые слова: цифровые аватары, многомодальная генерация, синхронизация речи и артикуляции, мимика, жесты и движения всего тела, корпуса аудиовизуальных данных, нейросетевые модели, трансформеры, диффузионные модели, гибридные архитектуры

Аннотация

Статья представляет систематический обзор современных подходов к генерации речи, мимики и жестов цифровых аватаров. Основное внимание уделено анализу корпусов данных, служащих базой для обучения и тестирования нейросетевых моделей. Обзор охватывает лицевые и портретные наборы изображений и видео, эмоциональные аудиовизуальные ресурсы, корпуса речи и жестикуляции, записанные в полевых условиях, многовидовые записи с использованием NeRF представлений, полнотелые и ориентированные на захват движения (MoCap) наборы данных, а также рассматриваются синтетические и коммерческие наборы данных, применяемые для восполнения недостатка реальных записей. Выявлены ключевые ограничения существующих ресурсов: постановочный характер большинства студийных корпусов, ограниченность спектра эмоций и культурных особенностей, а также разрыв между реальными и синтетическими данными. Показано, что отсутствие универсального многомодального корпуса затрудняет объективное сравнение методов и разработку единых эталонных тестов. В обзор включены также методы синтеза движений цифровых аватаров. Сопоставляются традиционные алгоритмы, основанные на фонемно-виземных соответствиях и морф-целях, с современными нейросетевыми архитектурами: рекуррентными и сверточными моделями, трансформерами, диффузионными и гибридными подходами. Анализируются их преимущества и ограничения в задачах синхронизации речи, артикуляции, воспроизведения мимики и генерации движений всего тела. Подчёркивается необходимость разработки комбинированных многомодальных корпусов и унифицированных стандартов оценки.

Скачивания

Данные скачивания пока не доступны.

Биография автора

Александр Александрович Аксёнов, Санкт-Петербургский Федеральный исследовательский центр Российской академии наук

канд. техн. наук, старший научный сотрудник лаборатории речевых и многомодальных интерфейсов

Литература

Lu Y. Live Speech Portraits: Real-Time Photorealistic Talking-Head Animation / Y. Lu, J. Chai, X. Cao // ACM Transactions on Graphics (TOG). – 2021. – Vol. 40 (6). – P. 1–17. doi: 10.1145/3478513.3480484.

Корзун В. А. Генерация мимики для виртуальных ассистентов // Труды Московского физико-технического института. – 2022. – Т. 14, № 3 (55). – С. 57–62.

Niu L. Audio2AB: Audio-Driven Collaborative Generation of Virtual Character Animation / L. Niu, W. Xie, D. Wang, Z. Cao, X. Liu // Virtual Reality & Intelligent Hardware. – 2024. – Vol. 6(1). – P. 56–70. doi: 10.1016/j.vrih.2023.08.006.

Drobyshev N. EMOPortraits: Emotion-Enhanced Multimodal One-Shot Head Avatars / N. Drobyshev, A. B. Casademunt, K. Vougioukas, Z. Landgraf, S. Petridis, M. Pantic // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). – 2024. – P. 8498–8507. doi: 10.1109/CVPR52733.2024.00812.

Xia Y. GMTalker: Gaussian Mixture-Based Audio-Driven Emotional Talking Video Portraits / Y. Xia, L. Wang, X. Deng, X. Luo, Y. Liu // arXiv preprint arXiv:2312.07669. – 2023.

Zhang J. Virbo: Multimodal Multilingual Avatar Video Generation in Digital Marketing / J. Zhang, J. Chen, C. Wang, Z. Yu, T. Qi, C. Liu, D. Wu // arXiv preprint arXiv:2403.11700. – 2024. doi: 10.48550/arXiv.2403.11700.

Kolotouros N. Instant 3D Human Avatar Generation Using Image Diffusion Models / N. Kolotouros, T. Alldieck, E. Corona, E. G. Bazavan, C. Sminchisescu // Proceedings of the European Conference on Computer Vision (ECCV). – Cham: Springer Nature Switzerland, 2024. – P. 177–195. doi: 10.48550/arXiv.2406.07516

Axyonov A. NeRF-LipSync: A Diffusion Model for Speech-Driven and View-Consistent Lip Synchronization in Digital Avatars / A. Axyonov, M. Dolgushin, D. Ryumin // The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences. – 2025. – Vol. 48. – P. 25–31. doi:10.5194/isprs-archives-XLVIII-2-2025-25-2025.

Аксёнов А. А. Метод генерации анимации цифрового аватара с речевой и невербальной синхронизацией на основе бимодальных данных / А. А. Аксёнов, Е. В. Рюмина, Д. А. Рюмин // Научно-технический вестник информационных технологий, механики и оптики. – 2025. – Т. 25, № 4. – С. 651–662. doi: 10.17586/2226-1494-2025-25-4-651-662.

Yan Y. DialogueNeRF: Towards Realistic Avatar Face-to-Face Conversation Video Generation / Y. Yan, Z. Zhou, Z. Wang, J. Gao, X. Yang // Visual Intelligence. – 2024. – Vol. 2(1). – P. 24. doi: 10.48550/arXiv.2203.07931.

Любахинец А. А. Интеграция реалистичных ИИ-аватаров в веб-сайты для улучшения пользовательского взаимодействия // Universum: технические науки. – 2024. – Т. 1, № 12 (129). – С. 47–55.

Hong P. Real-Time Speech-Driven Face Animation with Expressions Using Neural Networks / P. Hong, Z. Wen, T. S. Huang // IEEE Transactions on Neural Networks. – 2002. – Vol. 13 (4). – P. 916–927. doi: 10.1109/TNN.2002.1021892.

Rafiei Oskooei A. Seeing the Sound: Multilingual Lip Sync for Real-Time Face-to-Face Translation / A. Rafiei Oskooei, M. S. Aktaş, M. Keleş // Computers. – 2024. – Vol. 14 (1). – P. 7. doi: 10.3390/computers14010007.

Zhen R. Research on the Application of Virtual Human Synthesis Technology in Human-Computer Interaction / R. Zhen, W. Song, J. Cao // Proceedings of the IEEE/ACIS 22nd International Conference on Computer and Information Science (ICIS). – 2022. – P. 199–204. doi: 10.1109/ICIS54925.2022.9882355

Ravichandran S. Synthesizing Photorealistic Virtual Humans Through Cross-Modal Disentanglement / S. Ravichandran, O. Texler, D. Dinev, H. J. Kang // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). – 2023. – P. 4585–4594.

Cheng W. RITA: A Real-Time Interactive Talking Avatars Framework / W. Cheng, C. Wan, Y. Cao, S. Chen // arXiv preprint arXiv:2406.13093. – 2024. doi: 10.1109/CVPR52729.2023.00445.

Bozkurt E. Personalized Speech-Driven Expressive 3D Facial Animation Synthesis with Style Control / E. Bozkurt // arXiv preprint arXiv:2310.17011. – 2023. doi: 10.48550/arXiv.2310.17011.

Wu H. Speech-Driven 3D Face Animation with Composite and Regional Facial Movements / H. Wu, S. Zhou, J. Jia, J. Xing, Q. Wen, X. Wen // Proceedings of the 31st ACM International Conference on Multimedia. – 2023. – P. 6822–6830. doi: 10.1145/3581783.3611775

Karras T. A Style-Based Generator Architecture for Generative Adversarial Networks / T. Karras, S. Laine, T. Aila // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). – 2019. – P. 44014410. doi: 10.1109/CVPR.2019.00453.

Karras T. Progressive Growing of GANs for Improved Quality, Stability, and Variation / T. Karras, T. Aila, S. Laine, J. Lehtinen // arXiv preprint arXiv:1710.10196. – 2017.

Zhu H. CelebV-HQ: A Large-Scale Video Facial Attributes Dataset / H. Zhu, W. Wu, W. Zhu, L. Jiang, S. Tang, L. Zhang, C. C. Loy // Proceedings of the European Conference on Computer Vision (ECCV). – Cham: Springer Nature Switzerland, 2022. – P. 650–667. doi: 10.48550/arXiv.2207.12393

Xie L. VFHQ: A High-Quality Dataset and Benchmark for Video Face Super-Resolution / L. Xie, X. Wang, H. Zhang, C. Dong, Y. Shan // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). – 2022. – P. 657–666. doi: 10.1109/CVPRW56347.2022.00081.

Nagrani A. VoxCeleb: A Large-Scale Speaker Identification Dataset / A. Nagrani, J. S. Chung, A. Zisserman // Proceedings of Interspeech. – 2017. doi: 10.21437/Interspeech.2017-950.

Chung J. S. VoxCeleb2: Deep Speaker Recognition / J. S. Chung, A. Nagrani, A. Zisserman // Proceedings of Interspeech. – 2018. doi: 10.21437/Interspeech.2018-1929.

Zhang Z. Flow-Guided One-Shot Talking Face Generation with a High-Resolution Audio-Visual Dataset / Z. Zhang, L. Li, Y. Ding // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). – 2021. – P. 3660–3669. doi: 10.1109/CVPR46437.2021.00366.

Wang K. MEAD: A Large-Scale Audio-Visual Dataset for Emotional Talking-Face Generation / K. Wang, Q. Wu, L. Song, Z. Yang, W. Wu, C. Qian, R. He, Y. Qiao, C. C. Loy // Proceedings of the European Conference on Computer Vision (ECCV). – 2020. doi: 10.1007/9783-030-58589-1_42.

Livingstone S. R. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A Dynamic, Multimodal Set of Facial and Vocal Expressions in North American English / S. R. Livingstone, F. A. Russo // PLoS ONE. – 2018. – Vol. 13. – e0196391. doi: 10.1371/journal.pone.0196391.

Cao H. CREMA-D: Crowd-Sourced Emotional Multimodal Actors Dataset / H. Cao, D. G. Cooper, M. K. Keutmann, R. C. Gur, A. Nenkova, R. Verma // IEEE Transactions on Affective Computing. – 2014. – Vol. 5 (4). – P. 377–390. doi: 10.1109/TAFFC.2014.2336244.

Busso C. IEMOCAP: Interactive Emotional Dyadic Motion Capture Database / C. Busso, M. Bulut, C. Lee, E. Kazemzadeh, E. M. Provost, S. Kim, J. N. Chang, S. Lee, S. S. Narayanan // Language Resources and Evaluation. – 2008. – Vol. 42. – P. 335–359. doi: 10.1007/S10579-0089076-6.

Martin O. The eNTERFACE’05 Audio-Visual Emotion Database / O. Martin, I. Kotsia, B. M. Macq, I. Pitas // Proceedings of the ICDE Workshops. – 2006.

Burkhardt F. A Database of German Emotional Speech / F. Burkhardt, A. Paeschke, M. Rolfes, W. F. Sendlmeier, B. Weiss // Proceedings of Interspeech. – 2005. doi: 10.21437/Interspeech.2005-446.

Siarohin A. Motion Representations for Articulated Animation / A. Siarohin, O. J. Woodford, J. Ren, M. Chai, S. Tulyakov // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). – 2021. – P. 13648–13657. doi: 10.1109/CVPR46437.2021.01344.

Jafarian Y. Learning High Fidelity Depths of Dressed Humans by Watching Social Media Dance Videos / Y. Jafarian, H. S. Park // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). – 2021. – P. 12748–12757. doi: 10.1109/CVPR46437.2021.01256.

Zhuang Y. IDOL: Instant Photorealistic 3D Human Creation from a Single Image / Y. Zhuang, J. Lv, H. Wen, Q. Shuai, A. Zeng, H. Zhu, S. Chen, Y. Yang, X. Cao, W. Liu // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). – 2024. – P. 2630826319. doi: 10.1109/CVPR52734.2025.02450.

Huang Z. WildAvatar: Learning In-theWild 3D Avatars from the Web / Z. Huang, S. Hu, G. Wang, T. Liu, Y. Zang, Z. Cao, W. Li, Z. Liu // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). – 2024. – P. 15963–15975. doi: 10.1109/CVPR52734.2025.01488.

Gafni G. Dynamic Neural Radiance Fields for Monocular 4D Facial Avatar Reconstruction / G. Gafni, J. Thies, M. Zollhöfer, M. Nießner // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). – 2020. – P. 8645–8654. doi: 10.1109/CVPR46437.2021.00854.

Kirschstein T. NeRSemble: Multi-View Radiance Field Reconstruction of Human Heads / T. Kirschstein, S. Qian, S. Giebenhain, T. Walter, M. Nießner // ACM Transactions on Graphics (TOG). – 2023. – Vol. 42. – P. 1–14. doi: 10.1145/3592455.

Li X. Topo4D: Topology-Preserving Gaussian Splatting for High-Fidelity 4D Head Capture / X. Li, Y. Cheng, X. Ren, H. Jia, D. Xu, W. Zhu, Y. Yan // Proceedings of the European Conference on Computer Vision (ECCV). – 2024. doi: 10.48550/arXiv.2406.00440.

Jiang W. NeuMan: Neural Human Radiance Field from a Single Video / W. Jiang, K. M. Yi, G. Samei, O. Tuzel, A. Ranjan // Proceedings of the European Conference on Computer Vision (ECCV). – Cham: Springer Nature Switzerland, 2022. – P. 402–418. doi: 10.48550/arXiv.2203.12575.

Yu T. Function4D: Real-Time Human Volumetric Capture from Very Sparse Consumer RGBD Sensors / T. Yu, Z. Zheng, K. Guo, P. Liu, Q. Dai, Y. Liu // Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition (CVPR). – 2021. – P. 5742–5752. doi: 10.1109/CVPR46437.2021.00569.

Zheng Z. Structured Local Radiance Fields for Human Avatar Modeling / Z. Zheng, H. Huang, T. Yu, H. Zhang, Y. Guo, Y. Liu // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). – 2022. – P. 15872–15882. doi: 10.1109/CVPR52688.2022.01543.

Zheng Y. I M Avatar: Implicit Morphable Head Avatars from Videos / Y. Zheng, V. F. Abrevaya, X. Chen, M. C. Buhler, M. J. Black, O. Hilliges // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). – 2021. – P. 13535–13545. doi: 10.1109/CVPR52688.2022.01318.

Zheng Y. PointAvatar: Deformable PointBased Head Avatars from Videos / Y. Zheng, Y. Wang, G. Wetzstein, M. J. Black, O. Hilliges // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). – 2022. – P. 21057–21067. doi: 10.1109/CVPR52729.2023.02017.

Jiang T. InstantAvatar: Learning Avatars from Monocular Video in 60 Seconds / T. Jiang, X. Chen, J. Song, O. Hilliges // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). – 2022. – P. 1692216932. doi: 10.1109/CVPR52729.2023.01623.

Shen K. X-Avatar: Expressive Human Avatars / K. Shen, C. Guo, M. Kaufmann, J. J. Zarate, J. Valentin, J. Song, O. Hilliges // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). – 2023. – P. 1691116921. doi: 10.1109/CVPR52729.2023.01622.

Zheng Z. AvatarReX: Real-Time Expressive Full-Body Avatars / Z. Zheng, X. Zhao, H. Zhang, B. Liu, Y. Liu // ACM Transactions on Graphics (TOG). – 2023. – Vol. 42. – P. 1–19. doi: 10.1145/3592101.

Peng S. Neural Body: Implicit Neural Representations with Structured Latent Codes for Novel View Synthesis of Dynamic Humans / S. Peng, Y. Zhang, Y. Xu, Q. Wang, Q. Shuai, H. Bao, X. Zhou // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). – 2020. – P. 9050–9059. doi: 10.1109/CVPR46437.2021.00894.

Alldieck T. Video-Based Reconstruction of 3D People Models / T. Alldieck, M. A. Magnor, W. Xu, C. Theobalt, G. Pons-Moll // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). – 2018. – P. 8387–8397. doi: 10.1109/CVPR.2018.00875.

Mahmood N. AMASS: Archive of Motion Capture as Surface Shapes / N. Mahmood, N. Ghorbani, N. F. Troje, G. Pons-Moll, M. J. Black // Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). – 2019. – P. 5441–545. doi: 10.1109/ICCV.2019.00554.

Li R. AI Choreographer: Music Conditioned 3D Dance Generation with AIST++ / R. Li, S. Yang, D. A. Ross, A. Kanazawa // Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). – 2021. – P. 1338113392. doi: 10.1109/ICCV48922.2021.01315.

Lv X. HIMO: A New Benchmark for FullBody Human Interacting with Multiple Objects / X. Lv, L. Xu, Y. Yan, X. Jin, C. Xu, S. Wu, Y. Liu, L. Li, M. Bi, W. Zeng, X. Yang // Proceedings of the European Conference on Computer Vision (ECCV). – 2024. doi: 10.48550/arXiv.2407.12371.

Wood E. Fake It Till You Make It: Face Analysis in the Wild Using Synthetic Data Alone / E. Wood, T. Baltrušaitis, C. Hewitt, S. Dziadzio, M. Johnson, V. Estellers, T. J. Cashman, J. Shotton // Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). – 2021. – P. 3661–3671. doi: 10.1109/ICCV48922.2021.00366.

RenderPeople – Режим доступа: https://renderpeople.com/. – 15.09.2025).(Дата обращения:

Parke F. I. Computer Facial Animation / F. I. Parke, K. Waters. – Wellesley, MA: A K Peters, 1996.

Ostermann J. Animation of Synthetic Faces in MPEG-4 / J. Ostermann // Proceedings of Computer Animation ’98 (Cat. No.98EX169). – 1998. – P. 49–55. doi: 10.1109/CA.1998.681907.

Osipa J. Stop Staring: Facial Modeling and Animation Done Right / J. Osipa. – Indianapolis, IN: Wiley, 2003.

Cohen M. M. Modeling Coarticulation in Synthetic Visual Speech / M. M. Cohen, D. W. Massaro // Proceedings of the International Conference on Computer Animation. – 1993. doi: 10.1007/978-4-431-66911-1_13.

Massaro D. W. Perceiving Talking Faces: From Speech Perception to a Behavioral Principle / D. W. Massaro. – Cambridge, MA: MIT Press, 1999. doi: 10.2307/1423641.

Setyati E. Phoneme-Viseme Mapping for Indonesian Language Based on Blend Shape Animation / E. Setyati, S. Sumpeno, M. H. Purnomo, K. Mikami, M. Kakimoto // Proceedings of the International Conference on Information Technology and Electrical Engineering (ICITEE). – 2015.

Williams L. Performance-Driven Facial Animation / L. Williams // Proceedings of the 17th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH). – 1990. – P. 235–242. doi: 10.1145/97879.97906.

Habibie I. Imitator: Personalized Speech-Driven 3D Facial Animation / I. Habibie, S. Aliakbarian, D. P. Cosker, C. Theobalt, J. Thies // Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). – 2023. – P. 20564–20574. doi: 10.1109/ICCV51070.2023.01885.

Zhen R. Human-Computer Interaction System: A Survey of Talking-Head Generation / R. Zhen, W. Song, Q. He, J. Cao, L. Shi, J. Luo // Electronics. – 2023. – Vol. 12 (1). – P. 218. doi: 10.3390/electronics12010218.

Song W. TalkingStyle: Personalized Speech-Driven 3D Facial Animation with Style Preservation / W. Song, X. Wang, S. Zheng, S. Li, A. Hao, X. Hou // IEEE Transactions on Visualization and Computer Graphics. – 2024. – Vol. 31. – P. 4682–4694. doi: 10.1109/TVCG.2024.3409568.

Kr P. A Lip Sync Expert Is All You Need for Speech-to-Lip Generation in the Wild / P. Kr, R. Mukhopadhyay, V. P. Namboodiri, C. V. Jawahar // Proceedings of the 28th ACM International Conference on Multimedia. – 2020. – P. 484–492. doi: 10.1145/3394171.3413532.

Thambiraja B. 3DiFACE: Diffusion-Based Speech-Driven 3D Facial Animation and Editing / B. Thambiraja, S. Aliakbarian, D. Cosker, J. Thies // arXiv preprint arXiv:2312.00870. – 2023. doi:10.48550/arXiv.2312.00870.

Bengio Y. Learning Long-Term Dependencies with Gradient Descent Is Difficult / Y. Bengio, P. Y. Simard, P. Frasconi // IEEE Transactions on Neural Networks. – 1994. – Vol. 5 (2). – P. 157–166. doi: 10.1109/72.279181.

Hochreiter S. Long Short-Term Memory / S. Hochreiter, J. Schmidhuber // Neural Computation. – 1997. – Vol. 9. – P. 1735–1780. doi: 10.1162/neco.1997.9.8.1735.

Cho K. Learning Phrase Representations Using RNN Encoder–Decoder for Statistical Machine Translation / K. Cho, B. Van Merriënboer, Ç. Gülçehre, D. Bahdanau, F. Bougares, H. Schwenk, Y. Bengio // Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). – 2014. – P. 1724–1734. doi: 10.3115/v1/D14-1179.

Kumar R. ObamaNet: Photo-Realistic Lip-Sync from Text / R. Kumar, J. M. Sotelo, K. Kumar, A. D. Brébisson, Y. Bengio // arXiv preprint arXiv:1801.01442. – 2017.

Karras T. Audio-Driven Facial Animation by Joint End-to-End Learning of Pose and Emotion / T. Karras, T. Aila, S. Laine, A. Herva, J. Lehtinen // ACM Transactions on Graphics (TOG). – 2017. – Vol. 36. – P. 1–12. doi: 10.1145/3072959.3073658.

Aneja D. Real-Time Lip Sync for Live 2D Animation / D. Aneja, W. Li // arXiv preprint arXiv:1910.08685. – 2019.

Fan B. Photo-Real Talking Head with Deep Bidirectional LSTM / B. Fan, L. Wang, F. K. Soong, L. Xie // Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). – 2015. – P. 48844888. doi: 10.1109/ICASSP.2015.7178899.

Huang J. ConsistentID: Portrait Generation with Multimodal Fine-Grained Identity Preserving / J. Huang, X. Dong, W. Song, H. Li, J. Zhou, Y. Cheng, S. Liao, L. Chen, Y. Yan, X. Liang // arXiv preprint arXiv:2404.16771. – 2024. doi: 10.48550/arXiv.2404.16771.

Wei H. AniPortrait: Audio-Driven Synthesis of Photorealistic Portrait Animation / H. Wei, Z. Yang, Z. Wang // arXiv preprint arXiv:2403.17694. – 2024. doi: 10.48550/arXiv.2403.17694.

Vougioukas K. Realistic Speech-Driven Facial Animation with GANs / K. Vougioukas, S. Petridis, M. Pantic // International Journal of Computer Vision. – 2019. – Vol. 128. – P. 13981413. doi: 10.1007/s11263-019-01251-8.

Chai Y. Speech-Driven Facial Animation with Spectral Gathering and Temporal Attention / Y. Chai, Y. Weng, L. Wang, K. Zhou // Frontiers of Computer Science. – 2021. – Vol. 16. – Article 166306. doi: 10.1007/s11704-020-0133-7.

Zhuang Y. Learn2Talk: 3D Talking Face Learns from 2D Talking Face / Y. Zhuang, B. Cheng, Y. Cheng, Y. Jin, R. Liu, C. Li, X. Cheng, J. Liao, J. Lin // IEEE Transactions on Visualization and Computer Graphics. – 2024. – Vol. 31. – P. 5829–5841. doi: 10.1109/TVCG.2024.3476275.

Peng Z. SyncTalk: The Devil Is in the Synchronization for Talking Head Synthesis / Z. Peng, W. Hu, Y. Shi, X. Zhu, X. Zhang, H. Zhao, J. He, H. Liu, Z. Fan // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). – 2023. – P. 666–676. doi: 10.1109/CVPR52733.2024.00070.

Bai Z. Efficient 3D Implicit Head Avatar with Mesh-Anchored Hash Table Blendshapes / Z. Bai, F. Tan, S. Fanello, R. Pandey, M. Dou, S. Liu, P. Tan, Y. Zhang // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). – 2024. – P. 19751984. doi: 10.1109/CVPR52733.2024.00193.

Liu L. Neural Actor / L. Liu, M. Habermann, V. Rudnev, K. Sarkar, J. Gu, C. Theobalt // ACM Transactions on Graphics (TOG). – 2021. – Vol. 40. – P. 1–16.

Sun X. VividTalk: One-Shot Audio-Driven Talking Head Generation Based on 3D Hybrid Prior / X. Sun, L. Zhang, H. Zhu, P. Zhang, B. Zhang, X. Ji, K. Zhou, D. Gao, L. Bo, X. Cao // Proceedings of the International Conference on 3D Vision (3DV). – 2023. – P. 713–722. doi: 10.1109/3DV66043.2025.00071.

Gao Y. High-Fidelity and Freely Controllable Talking Head Video Generation / Y. Gao, Y. Zhou, J. Wang, X. Li, X. Ming, Y. Lu // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). – 2023. – P. 5609–5619. doi: 10.1109/CVPR52729.2023.00543.

Xu S. VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time / S. Xu, G. Chen, Y. Guo, J. Yang, C. Li, Z. Zang, Y. Zhang, X. Tong, B. Guo // Advances in Neural Information Processing Systems. – 2024. – Vol. 37. – P. 660–684. doi: 10.48550/arXiv.2404.10667.

Wang Y. InstructAvatar: Text-Guided Emotion and Motion Control for Avatar Generation / Y. Wang, J. Guo, J. Bai, R. Yu, T. He, X. Tan, X. Sun, J. Bian // arXiv preprint arXiv:2405.15758. – 202. doi: 10.48550/arXiv.2405.15758.

Xu Z. MagicAnimate: Temporally Consistent Human Image Animation Using Diffusion Model / Z. Xu, J. Zhang, J. H. Liew, H. Yan, J. Liu, C. Zhang, J. Feng, M. Z. Shou // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). – 2023. – P. 14811490. doi: 10.1109/CVPR52733.2024.00147.

Tian L. EMO: Emote Portrait Alive – Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions / L. Tian, Q. Wang, B. Zhang, L. Bo // Proceedings of the European Conference on Computer Vision (ECCV). – 2024. doi: 10.48550/arXiv.2402.17485.

Ye Z. Real3D-Portrait: One-Shot Realistic 3D Talking Portrait Synthesis / Z. Ye, T. Zhong, Y. Ren, J. Yang, W. Li, J. Huang, Z. Jiang, J. He, R. Huang, J. Liu, C. Zhang, X. Yin, Z. Ma, Z. Zhao // arXiv preprint arXiv:2401.08503. – 2024. doi: 10.48550/arXiv.2401.08503.

Shao R. Human4DiT: Free-View Human Video Generation with 4D Diffusion Transformer / R. Shao, Y. Pang, Z. Zheng, J. Sun, Y. Liu // arXiv preprint arXiv:2405.17405. – 2024. doi:10.48550/arXiv.2405.17405.

Corona E. VLOGGER: Multimodal Diffusion for Embodied Avatar Synthesis / E. Corona, A. Zanfir, E. G. Bazavan, N. Kolotouros, T. Alldieck, C. Sminchisescu // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). – 2024. – P. 1589615908. doi: 10.1109/CVPR52734.2025.01482.

Hu L. GaussianAvatar: Towards Realistic Human Avatar Modeling from a Single Video via Animatable 3D Gaussians / L. Hu, H. Zhang, Y. Zhang, B. Zhou, B. Liu, S. Zhang, L. Nie // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). – 2024. – P. 634–644. doi: 10.1109/CVPR52733.2024.00067.