MULTIMODAL GENERATION OF SPEECH, FACIAL EXPRESSIONS, AND GESTURES IN DIGITAL AVATARS: CURRENT METHODS AND FUTURE PROSPECTS

Authors

DOI:

https://doi.org/10.17308/sait/1995-5499/2025/4/155-182

Keywords:

digital avatars, multimodal generation, speech-articulation synchronization, facial expressions, gestures and full-body motion, audiovisual corpora, neural network models, transformers, diffusion models, hybrid architectures

Abstract

This article presents a systematic review of contemporary approaches to the generation of speech, facial expressions, and gestures in digital avatars. Particular attention is devoted to the analysis of datasets that serve as the foundation for training and evaluating neural net work models. The review encompasses facial and portrait image and video datasets, audiovisual resources annotated with emotional states, speech and gesture corpora recorded in naturalistic settings, multiview recordings employing NeRF-based representations, full-body and motion capture (MoCap) datasets, as well as synthetic and commercial datasets designed to compensate for the scarcity of real-world recordings. Key limitations of existing resources are identified, including the staged nature of most studio-based corpora, the restricted diversity of emotional and cultural features, and the gap between real and synthetic data. It is demonstrated that the lack of a universal multimodal corpus hinders both objective benchmarking and the development of unified evaluation protocols. The review further examines methods for synthesizing avatar movements, contrasting traditional algorithms–based on phoneme-viseme mappings and morph targets – with modern neural architectures, including recurrent and convolutional models, transformers, diffusion-based approaches, and hybrid frameworks. Their advantages and limitations are analyzed in the context of speech synchronization, articulation, facial expression reproduction, and full-body motion generation. The necessity of developing integrated multimodal corpora and standardized evaluation benchmarks is underscored.

Author Biography

  • Alexandr A. Axyonov, St. Petersburg Federal Research Center of the Russian Academy of Sciences

    PhD, Senior Researcher of the Speech and Multimodal Interfaces Laboratory

References

Lu Y. Live Speech Portraits: Real-Time Photorealistic Talking-Head Animation / Y. Lu, J. Chai, X. Cao // ACM Transactions on Graphics (TOG). – 2021. – Vol. 40 (6). – P. 1–17. doi: 10.1145/3478513.3480484.

Корзун В. А. Генерация мимики для виртуальных ассистентов // Труды Московского физико-технического института. – 2022. – Т. 14, № 3 (55). – С. 57–62.

Niu L. Audio2AB: Audio-Driven Collaborative Generation of Virtual Character Animation / L. Niu, W. Xie, D. Wang, Z. Cao, X. Liu // Virtual Reality & Intelligent Hardware. – 2024. – Vol. 6(1). – P. 56–70. doi: 10.1016/j.vrih.2023.08.006.

Drobyshev N. EMOPortraits: Emotion-Enhanced Multimodal One-Shot Head Avatars / N. Drobyshev, A. B. Casademunt, K. Vougioukas, Z. Landgraf, S. Petridis, M. Pantic // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). – 2024. – P. 8498–8507. doi: 10.1109/CVPR52733.2024.00812.

Xia Y. GMTalker: Gaussian Mixture-Based Audio-Driven Emotional Talking Video Portraits / Y. Xia, L. Wang, X. Deng, X. Luo, Y. Liu // arXiv preprint arXiv:2312.07669. – 2023.

Zhang J. Virbo: Multimodal Multilingual Avatar Video Generation in Digital Marketing / J. Zhang, J. Chen, C. Wang, Z. Yu, T. Qi, C. Liu, D. Wu // arXiv preprint arXiv:2403.11700. – 2024. doi: 10.48550/arXiv.2403.11700.

Kolotouros N. Instant 3D Human Avatar Generation Using Image Diffusion Models / N. Kolotouros, T. Alldieck, E. Corona, E. G. Bazavan, C. Sminchisescu // Proceedings of the European Conference on Computer Vision (ECCV). – Cham: Springer Nature Switzerland, 2024. – P. 177–195. doi: 10.48550/arXiv.2406.07516

Axyonov A. NeRF-LipSync: A Diffusion Model for Speech-Driven and View-Consistent Lip Synchronization in Digital Avatars / A. Axyonov, M. Dolgushin, D. Ryumin // The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences. – 2025. – Vol. 48. – P. 25–31. doi:10.5194/isprs-archives-XLVIII-2-2025-25-2025.

Аксёнов А. А. Метод генерации анимации цифрового аватара с речевой и невербальной синхронизацией на основе бимодальных данных / А. А. Аксёнов, Е. В. Рюмина, Д. А. Рюмин // Научно-технический вестник информационных технологий, механики и оптики. – 2025. – Т. 25, № 4. – С. 651–662. doi: 10.17586/2226-1494-2025-25-4-651-662.

Yan Y. DialogueNeRF: Towards Realistic Avatar Face-to-Face Conversation Video Generation / Y. Yan, Z. Zhou, Z. Wang, J. Gao, X. Yang // Visual Intelligence. – 2024. – Vol. 2(1). – P. 24. doi: 10.48550/arXiv.2203.07931.

Любахинец А. А. Интеграция реалистичных ИИ-аватаров в веб-сайты для улучшения пользовательского взаимодействия // Universum: технические науки. – 2024. – Т. 1, № 12 (129). – С. 47–55.

Hong P. Real-Time Speech-Driven Face Animation with Expressions Using Neural Networks / P. Hong, Z. Wen, T. S. Huang // IEEE Transactions on Neural Networks. – 2002. – Vol. 13 (4). – P. 916–927. doi: 10.1109/TNN.2002.1021892.

Rafiei Oskooei A. Seeing the Sound: Multilingual Lip Sync for Real-Time Face-to-Face Translation / A. Rafiei Oskooei, M. S. Aktaş, M. Keleş // Computers. – 2024. – Vol. 14 (1). – P. 7. doi: 10.3390/computers14010007.

Zhen R. Research on the Application of Virtual Human Synthesis Technology in Human-Computer Interaction / R. Zhen, W. Song, J. Cao // Proceedings of the IEEE/ACIS 22nd International Conference on Computer and Information Science (ICIS). – 2022. – P. 199–204. doi: 10.1109/ICIS54925.2022.9882355

Ravichandran S. Synthesizing Photorealistic Virtual Humans Through Cross-Modal Disentanglement / S. Ravichandran, O. Texler, D. Dinev, H. J. Kang // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). – 2023. – P. 4585–4594.

Cheng W. RITA: A Real-Time Interactive Talking Avatars Framework / W. Cheng, C. Wan, Y. Cao, S. Chen // arXiv preprint arXiv:2406.13093. – 2024. doi: 10.1109/CVPR52729.2023.00445.

Bozkurt E. Personalized Speech-Driven Expressive 3D Facial Animation Synthesis with Style Control / E. Bozkurt // arXiv preprint arXiv:2310.17011. – 2023. doi: 10.48550/arXiv.2310.17011.

Wu H. Speech-Driven 3D Face Animation with Composite and Regional Facial Movements / H. Wu, S. Zhou, J. Jia, J. Xing, Q. Wen, X. Wen // Proceedings of the 31st ACM International Conference on Multimedia. – 2023. – P. 6822–6830. doi: 10.1145/3581783.3611775

Karras T. A Style-Based Generator Architecture for Generative Adversarial Networks / T. Karras, S. Laine, T. Aila // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). – 2019. – P. 44014410. doi: 10.1109/CVPR.2019.00453.

Karras T. Progressive Growing of GANs for Improved Quality, Stability, and Variation / T. Karras, T. Aila, S. Laine, J. Lehtinen // arXiv preprint arXiv:1710.10196. – 2017.

Zhu H. CelebV-HQ: A Large-Scale Video Facial Attributes Dataset / H. Zhu, W. Wu, W. Zhu, L. Jiang, S. Tang, L. Zhang, C. C. Loy // Proceedings of the European Conference on Computer Vision (ECCV). – Cham: Springer Nature Switzerland, 2022. – P. 650–667. doi: 10.48550/arXiv.2207.12393

Xie L. VFHQ: A High-Quality Dataset and Benchmark for Video Face Super-Resolution / L. Xie, X. Wang, H. Zhang, C. Dong, Y. Shan // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). – 2022. – P. 657–666. doi: 10.1109/CVPRW56347.2022.00081.

Nagrani A. VoxCeleb: A Large-Scale Speaker Identification Dataset / A. Nagrani, J. S. Chung, A. Zisserman // Proceedings of Interspeech. – 2017. doi: 10.21437/Interspeech.2017-950.

Chung J. S. VoxCeleb2: Deep Speaker Recognition / J. S. Chung, A. Nagrani, A. Zisserman // Proceedings of Interspeech. – 2018. doi: 10.21437/Interspeech.2018-1929.

Zhang Z. Flow-Guided One-Shot Talking Face Generation with a High-Resolution Audio-Visual Dataset / Z. Zhang, L. Li, Y. Ding // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). – 2021. – P. 3660–3669. doi: 10.1109/CVPR46437.2021.00366.

Wang K. MEAD: A Large-Scale Audio-Visual Dataset for Emotional Talking-Face Generation / K. Wang, Q. Wu, L. Song, Z. Yang, W. Wu, C. Qian, R. He, Y. Qiao, C. C. Loy // Proceedings of the European Conference on Computer Vision (ECCV). – 2020. doi: 10.1007/9783-030-58589-1_42.

Livingstone S. R. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A Dynamic, Multimodal Set of Facial and Vocal Expressions in North American English / S. R. Livingstone, F. A. Russo // PLoS ONE. – 2018. – Vol. 13. – e0196391. doi: 10.1371/journal.pone.0196391.

Cao H. CREMA-D: Crowd-Sourced Emotional Multimodal Actors Dataset / H. Cao, D. G. Cooper, M. K. Keutmann, R. C. Gur, A. Nenkova, R. Verma // IEEE Transactions on Affective Computing. – 2014. – Vol. 5 (4). – P. 377–390. doi: 10.1109/TAFFC.2014.2336244.

Busso C. IEMOCAP: Interactive Emotional Dyadic Motion Capture Database / C. Busso, M. Bulut, C. Lee, E. Kazemzadeh, E. M. Provost, S. Kim, J. N. Chang, S. Lee, S. S. Narayanan // Language Resources and Evaluation. – 2008. – Vol. 42. – P. 335–359. doi: 10.1007/S10579-0089076-6.

Martin O. The eNTERFACE’05 Audio-Visual Emotion Database / O. Martin, I. Kotsia, B. M. Macq, I. Pitas // Proceedings of the ICDE Workshops. – 2006.

Burkhardt F. A Database of German Emotional Speech / F. Burkhardt, A. Paeschke, M. Rolfes, W. F. Sendlmeier, B. Weiss // Proceedings of Interspeech. – 2005. doi: 10.21437/Interspeech.2005-446.

Siarohin A. Motion Representations for Articulated Animation / A. Siarohin, O. J. Woodford, J. Ren, M. Chai, S. Tulyakov // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). – 2021. – P. 13648–13657. doi: 10.1109/CVPR46437.2021.01344.

Jafarian Y. Learning High Fidelity Depths of Dressed Humans by Watching Social Media Dance Videos / Y. Jafarian, H. S. Park // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). – 2021. – P. 12748–12757. doi: 10.1109/CVPR46437.2021.01256.

Zhuang Y. IDOL: Instant Photorealistic 3D Human Creation from a Single Image / Y. Zhuang, J. Lv, H. Wen, Q. Shuai, A. Zeng, H. Zhu, S. Chen, Y. Yang, X. Cao, W. Liu // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). – 2024. – P. 2630826319. doi: 10.1109/CVPR52734.2025.02450.

Huang Z. WildAvatar: Learning In-theWild 3D Avatars from the Web / Z. Huang, S. Hu, G. Wang, T. Liu, Y. Zang, Z. Cao, W. Li, Z. Liu // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). – 2024. – P. 15963–15975. doi: 10.1109/CVPR52734.2025.01488.

Gafni G. Dynamic Neural Radiance Fields for Monocular 4D Facial Avatar Reconstruction / G. Gafni, J. Thies, M. Zollhöfer, M. Nießner // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). – 2020. – P. 8645–8654. doi: 10.1109/CVPR46437.2021.00854.

Kirschstein T. NeRSemble: Multi-View Radiance Field Reconstruction of Human Heads / T. Kirschstein, S. Qian, S. Giebenhain, T. Walter, M. Nießner // ACM Transactions on Graphics (TOG). – 2023. – Vol. 42. – P. 1–14. doi: 10.1145/3592455.

Li X. Topo4D: Topology-Preserving Gaussian Splatting for High-Fidelity 4D Head Capture / X. Li, Y. Cheng, X. Ren, H. Jia, D. Xu, W. Zhu, Y. Yan // Proceedings of the European Conference on Computer Vision (ECCV). – 2024. doi: 10.48550/arXiv.2406.00440.

Jiang W. NeuMan: Neural Human Radiance Field from a Single Video / W. Jiang, K. M. Yi, G. Samei, O. Tuzel, A. Ranjan // Proceedings of the European Conference on Computer Vision (ECCV). – Cham: Springer Nature Switzerland, 2022. – P. 402–418. doi: 10.48550/arXiv.2203.12575.

Yu T. Function4D: Real-Time Human Volumetric Capture from Very Sparse Consumer RGBD Sensors / T. Yu, Z. Zheng, K. Guo, P. Liu, Q. Dai, Y. Liu // Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition (CVPR). – 2021. – P. 5742–5752. doi: 10.1109/CVPR46437.2021.00569.

Zheng Z. Structured Local Radiance Fields for Human Avatar Modeling / Z. Zheng, H. Huang, T. Yu, H. Zhang, Y. Guo, Y. Liu // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). – 2022. – P. 15872–15882. doi: 10.1109/CVPR52688.2022.01543.

Zheng Y. I M Avatar: Implicit Morphable Head Avatars from Videos / Y. Zheng, V. F. Abrevaya, X. Chen, M. C. Buhler, M. J. Black, O. Hilliges // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). – 2021. – P. 13535–13545. doi: 10.1109/CVPR52688.2022.01318.

Zheng Y. PointAvatar: Deformable PointBased Head Avatars from Videos / Y. Zheng, Y. Wang, G. Wetzstein, M. J. Black, O. Hilliges // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). – 2022. – P. 21057–21067. doi: 10.1109/CVPR52729.2023.02017.

Jiang T. InstantAvatar: Learning Avatars from Monocular Video in 60 Seconds / T. Jiang, X. Chen, J. Song, O. Hilliges // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). – 2022. – P. 1692216932. doi: 10.1109/CVPR52729.2023.01623.

Shen K. X-Avatar: Expressive Human Avatars / K. Shen, C. Guo, M. Kaufmann, J. J. Zarate, J. Valentin, J. Song, O. Hilliges // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). – 2023. – P. 1691116921. doi: 10.1109/CVPR52729.2023.01622.

Zheng Z. AvatarReX: Real-Time Expressive Full-Body Avatars / Z. Zheng, X. Zhao, H. Zhang, B. Liu, Y. Liu // ACM Transactions on Graphics (TOG). – 2023. – Vol. 42. – P. 1–19. doi: 10.1145/3592101.

Peng S. Neural Body: Implicit Neural Representations with Structured Latent Codes for Novel View Synthesis of Dynamic Humans / S. Peng, Y. Zhang, Y. Xu, Q. Wang, Q. Shuai, H. Bao, X. Zhou // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). – 2020. – P. 9050–9059. doi: 10.1109/CVPR46437.2021.00894.

Alldieck T. Video-Based Reconstruction of 3D People Models / T. Alldieck, M. A. Magnor, W. Xu, C. Theobalt, G. Pons-Moll // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). – 2018. – P. 8387–8397. doi: 10.1109/CVPR.2018.00875.

Mahmood N. AMASS: Archive of Motion Capture as Surface Shapes / N. Mahmood, N. Ghorbani, N. F. Troje, G. Pons-Moll, M. J. Black // Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). – 2019. – P. 5441–545. doi: 10.1109/ICCV.2019.00554.

Li R. AI Choreographer: Music Conditioned 3D Dance Generation with AIST++ / R. Li, S. Yang, D. A. Ross, A. Kanazawa // Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). – 2021. – P. 1338113392. doi: 10.1109/ICCV48922.2021.01315.

Lv X. HIMO: A New Benchmark for FullBody Human Interacting with Multiple Objects / X. Lv, L. Xu, Y. Yan, X. Jin, C. Xu, S. Wu, Y. Liu, L. Li, M. Bi, W. Zeng, X. Yang // Proceedings of the European Conference on Computer Vision (ECCV). – 2024. doi: 10.48550/arXiv.2407.12371.

Wood E. Fake It Till You Make It: Face Analysis in the Wild Using Synthetic Data Alone / E. Wood, T. Baltrušaitis, C. Hewitt, S. Dziadzio, M. Johnson, V. Estellers, T. J. Cashman, J. Shotton // Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). – 2021. – P. 3661–3671. doi: 10.1109/ICCV48922.2021.00366.

RenderPeople – Режим доступа: https://renderpeople.com/. – 15.09.2025).(Дата обращения:

Parke F. I. Computer Facial Animation / F. I. Parke, K. Waters. – Wellesley, MA: A K Peters, 1996.

Ostermann J. Animation of Synthetic Faces in MPEG-4 / J. Ostermann // Proceedings of Computer Animation ’98 (Cat. No.98EX169). – 1998. – P. 49–55. doi: 10.1109/CA.1998.681907.

Osipa J. Stop Staring: Facial Modeling and Animation Done Right / J. Osipa. – Indianapolis, IN: Wiley, 2003.

Cohen M. M. Modeling Coarticulation in Synthetic Visual Speech / M. M. Cohen, D. W. Massaro // Proceedings of the International Conference on Computer Animation. – 1993. doi: 10.1007/978-4-431-66911-1_13.

Massaro D. W. Perceiving Talking Faces: From Speech Perception to a Behavioral Principle / D. W. Massaro. – Cambridge, MA: MIT Press, 1999. doi: 10.2307/1423641.

Setyati E. Phoneme-Viseme Mapping for Indonesian Language Based on Blend Shape Animation / E. Setyati, S. Sumpeno, M. H. Purnomo, K. Mikami, M. Kakimoto // Proceedings of the International Conference on Information Technology and Electrical Engineering (ICITEE). – 2015.

Williams L. Performance-Driven Facial Animation / L. Williams // Proceedings of the 17th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH). – 1990. – P. 235–242. doi: 10.1145/97879.97906.

Habibie I. Imitator: Personalized Speech-Driven 3D Facial Animation / I. Habibie, S. Aliakbarian, D. P. Cosker, C. Theobalt, J. Thies // Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). – 2023. – P. 20564–20574. doi: 10.1109/ICCV51070.2023.01885.

Zhen R. Human-Computer Interaction System: A Survey of Talking-Head Generation / R. Zhen, W. Song, Q. He, J. Cao, L. Shi, J. Luo // Electronics. – 2023. – Vol. 12 (1). – P. 218. doi: 10.3390/electronics12010218.

Song W. TalkingStyle: Personalized Speech-Driven 3D Facial Animation with Style Preservation / W. Song, X. Wang, S. Zheng, S. Li, A. Hao, X. Hou // IEEE Transactions on Visualization and Computer Graphics. – 2024. – Vol. 31. – P. 4682–4694. doi: 10.1109/TVCG.2024.3409568.

Kr P. A Lip Sync Expert Is All You Need for Speech-to-Lip Generation in the Wild / P. Kr, R. Mukhopadhyay, V. P. Namboodiri, C. V. Jawahar // Proceedings of the 28th ACM International Conference on Multimedia. – 2020. – P. 484–492. doi: 10.1145/3394171.3413532.

Thambiraja B. 3DiFACE: Diffusion-Based Speech-Driven 3D Facial Animation and Editing / B. Thambiraja, S. Aliakbarian, D. Cosker, J. Thies // arXiv preprint arXiv:2312.00870. – 2023. doi:10.48550/arXiv.2312.00870.

Bengio Y. Learning Long-Term Dependencies with Gradient Descent Is Difficult / Y. Bengio, P. Y. Simard, P. Frasconi // IEEE Transactions on Neural Networks. – 1994. – Vol. 5 (2). – P. 157–166. doi: 10.1109/72.279181.

Hochreiter S. Long Short-Term Memory / S. Hochreiter, J. Schmidhuber // Neural Computation. – 1997. – Vol. 9. – P. 1735–1780. doi: 10.1162/neco.1997.9.8.1735.

Cho K. Learning Phrase Representations Using RNN Encoder–Decoder for Statistical Machine Translation / K. Cho, B. Van Merriënboer, Ç. Gülçehre, D. Bahdanau, F. Bougares, H. Schwenk, Y. Bengio // Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). – 2014. – P. 1724–1734. doi: 10.3115/v1/D14-1179.

Kumar R. ObamaNet: Photo-Realistic Lip-Sync from Text / R. Kumar, J. M. Sotelo, K. Kumar, A. D. Brébisson, Y. Bengio // arXiv preprint arXiv:1801.01442. – 2017.

Karras T. Audio-Driven Facial Animation by Joint End-to-End Learning of Pose and Emotion / T. Karras, T. Aila, S. Laine, A. Herva, J. Lehtinen // ACM Transactions on Graphics (TOG). – 2017. – Vol. 36. – P. 1–12. doi: 10.1145/3072959.3073658.

Aneja D. Real-Time Lip Sync for Live 2D Animation / D. Aneja, W. Li // arXiv preprint arXiv:1910.08685. – 2019.

Fan B. Photo-Real Talking Head with Deep Bidirectional LSTM / B. Fan, L. Wang, F. K. Soong, L. Xie // Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). – 2015. – P. 48844888. doi: 10.1109/ICASSP.2015.7178899.

Huang J. ConsistentID: Portrait Generation with Multimodal Fine-Grained Identity Preserving / J. Huang, X. Dong, W. Song, H. Li, J. Zhou, Y. Cheng, S. Liao, L. Chen, Y. Yan, X. Liang // arXiv preprint arXiv:2404.16771. – 2024. doi: 10.48550/arXiv.2404.16771.

Wei H. AniPortrait: Audio-Driven Synthesis of Photorealistic Portrait Animation / H. Wei, Z. Yang, Z. Wang // arXiv preprint arXiv:2403.17694. – 2024. doi: 10.48550/arXiv.2403.17694.

Vougioukas K. Realistic Speech-Driven Facial Animation with GANs / K. Vougioukas, S. Petridis, M. Pantic // International Journal of Computer Vision. – 2019. – Vol. 128. – P. 13981413. doi: 10.1007/s11263-019-01251-8.

Chai Y. Speech-Driven Facial Animation with Spectral Gathering and Temporal Attention / Y. Chai, Y. Weng, L. Wang, K. Zhou // Frontiers of Computer Science. – 2021. – Vol. 16. – Article 166306. doi: 10.1007/s11704-020-0133-7.

Zhuang Y. Learn2Talk: 3D Talking Face Learns from 2D Talking Face / Y. Zhuang, B. Cheng, Y. Cheng, Y. Jin, R. Liu, C. Li, X. Cheng, J. Liao, J. Lin // IEEE Transactions on Visualization and Computer Graphics. – 2024. – Vol. 31. – P. 5829–5841. doi: 10.1109/TVCG.2024.3476275.

Peng Z. SyncTalk: The Devil Is in the Synchronization for Talking Head Synthesis / Z. Peng, W. Hu, Y. Shi, X. Zhu, X. Zhang, H. Zhao, J. He, H. Liu, Z. Fan // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). – 2023. – P. 666–676. doi: 10.1109/CVPR52733.2024.00070.

Bai Z. Efficient 3D Implicit Head Avatar with Mesh-Anchored Hash Table Blendshapes / Z. Bai, F. Tan, S. Fanello, R. Pandey, M. Dou, S. Liu, P. Tan, Y. Zhang // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). – 2024. – P. 19751984. doi: 10.1109/CVPR52733.2024.00193.

Liu L. Neural Actor / L. Liu, M. Habermann, V. Rudnev, K. Sarkar, J. Gu, C. Theobalt // ACM Transactions on Graphics (TOG). – 2021. – Vol. 40. – P. 1–16.

Sun X. VividTalk: One-Shot Audio-Driven Talking Head Generation Based on 3D Hybrid Prior / X. Sun, L. Zhang, H. Zhu, P. Zhang, B. Zhang, X. Ji, K. Zhou, D. Gao, L. Bo, X. Cao // Proceedings of the International Conference on 3D Vision (3DV). – 2023. – P. 713–722. doi: 10.1109/3DV66043.2025.00071.

Gao Y. High-Fidelity and Freely Controllable Talking Head Video Generation / Y. Gao, Y. Zhou, J. Wang, X. Li, X. Ming, Y. Lu // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). – 2023. – P. 5609–5619. doi: 10.1109/CVPR52729.2023.00543.

Xu S. VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time / S. Xu, G. Chen, Y. Guo, J. Yang, C. Li, Z. Zang, Y. Zhang, X. Tong, B. Guo // Advances in Neural Information Processing Systems. – 2024. – Vol. 37. – P. 660–684. doi: 10.48550/arXiv.2404.10667.

Wang Y. InstructAvatar: Text-Guided Emotion and Motion Control for Avatar Generation / Y. Wang, J. Guo, J. Bai, R. Yu, T. He, X. Tan, X. Sun, J. Bian // arXiv preprint arXiv:2405.15758. – 202. doi: 10.48550/arXiv.2405.15758.

Xu Z. MagicAnimate: Temporally Consistent Human Image Animation Using Diffusion Model / Z. Xu, J. Zhang, J. H. Liew, H. Yan, J. Liu, C. Zhang, J. Feng, M. Z. Shou // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). – 2023. – P. 14811490. doi: 10.1109/CVPR52733.2024.00147.

Tian L. EMO: Emote Portrait Alive – Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions / L. Tian, Q. Wang, B. Zhang, L. Bo // Proceedings of the European Conference on Computer Vision (ECCV). – 2024. doi: 10.48550/arXiv.2402.17485.

Ye Z. Real3D-Portrait: One-Shot Realistic 3D Talking Portrait Synthesis / Z. Ye, T. Zhong, Y. Ren, J. Yang, W. Li, J. Huang, Z. Jiang, J. He, R. Huang, J. Liu, C. Zhang, X. Yin, Z. Ma, Z. Zhao // arXiv preprint arXiv:2401.08503. – 2024. doi: 10.48550/arXiv.2401.08503.

Shao R. Human4DiT: Free-View Human Video Generation with 4D Diffusion Transformer / R. Shao, Y. Pang, Z. Zheng, J. Sun, Y. Liu // arXiv preprint arXiv:2405.17405. – 2024. doi:10.48550/arXiv.2405.17405.

Corona E. VLOGGER: Multimodal Diffusion for Embodied Avatar Synthesis / E. Corona, A. Zanfir, E. G. Bazavan, N. Kolotouros, T. Alldieck, C. Sminchisescu // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). – 2024. – P. 1589615908. doi: 10.1109/CVPR52734.2025.01482.

Hu L. GaussianAvatar: Towards Realistic Human Avatar Modeling from a Single Video via Animatable 3D Gaussians / L. Hu, H. Zhang, Y. Zhang, B. Zhou, B. Liu, S. Zhang, L. Nie // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). – 2024. – P. 634–644. doi: 10.1109/CVPR52733.2024.00067.

Published

2025-12-11

Issue

Section

Intelligent Information Systems, Data Analysis and Machine Learning

How to Cite

MULTIMODAL GENERATION OF SPEECH, FACIAL EXPRESSIONS, AND GESTURES IN DIGITAL AVATARS: CURRENT METHODS AND FUTURE PROSPECTS. (2025). Proceedings of Voronezh State University. Series: Systems Analysis and Information Technologies, 4, 155-182. https://doi.org/10.17308/sait/1995-5499/2025/4/155-182