О КРИТЕРИЯХ ВЫБОРА НЕЙРОСЕТЕВЫХ МОДЕЛЕЙ ВИЗУАЛЬНЫХ  ТРАНСФОРМЕРОВ ДЛЯ РЕАЛИЗАЦИИ НА ВЫЧИСЛИТЕЛЬНЫХ  УСТРОЙСТВАХ С ОГРАНИЧЕННЫМИ РЕСУРСАМИ

Роман Борисович Рыбка; Артём Викторович Грязнов; Иван Александрович Молошников; Максим Сергеевич Скороходов; Александр Георгиевич Сбоев

doi:10.17308/sait/1995-5499/2025/4/198-218

Authors

Roman B. Rybka National Research Centre «Kurchatov Institute» https://orcid.org/0000-0002-5595-6398 (unauthenticated)
Artem V. Gryaznov National Research Centre «Kurchatov Institute» https://orcid.org/0000-0003-0449-4549 (unauthenticated)
Ivan A. Moloshnikov National Research Centre «Kurchatov Institute» https://orcid.org/0000-0002-5000-7257 (unauthenticated)
Maksim S. Skorokhodov National Research Centre «Kurchatov Institute» https://orcid.org/0000-0001-9399-9332 (unauthenticated)
Aleksandr G. Aleksandr G. National Research Centre «Kurchatov Institute» https://orcid.org/0000-0002-6921-4133 (unauthenticated)

DOI:

https://doi.org/10.17308/sait/1995-5499/2025/4/198-218

Keywords:

transformers, convolutional networks, image classification, ImageNet, edge devices, model resource intensity

Abstract

The advancement of intelligent data analysis tools and their widespread implementation necessitate the development of procedures for improving the efficiency of neural network model execution on end devices. This paper proposes criteria for selecting neural network models for subsequent execution on computing resource-constrained devices, such as on edge devices. In addition to network accuracy and size, the set of criteria includes indicators of the neural network model depth and the total number of parameters, weights, and activations of the largest layer, which determine the latency and the memory requirements of the end device. The compiled set of criteria allowed us to consider several approaches to comparing and selecting models, which included forming a Pareto frontier and ranking according to the TOPSIS with various significance coefficients. Using the ImageNet image classification task, an example of a comparative evaluation of high-accuracy small-scale models based on transformer and convolutional architectures is demonstrated. Among them, various configurations were considered, differing in the methods of encoding the input image and processing features in the internal representations of the network. The analysis allowed us to select models with high classification accuracy at 0.81 Acc: EVA-02 Ti and RepViT M1.1. The selected models are balanced in terms of network depth and maximum layer size, which is significant for small models. The presented results demonstrate the potential for flexible use of criteria for selecting models for a specific device and identify bottlenecks for subsequent model modification to improve resource utilization.

Author Biographies

Roman B. Rybka, National Research Centre «Kurchatov Institute»

PhD in Engineering sciences, Leading research
Artem V. Gryaznov, National Research Centre «Kurchatov Institute»

Junior research
Ivan A. Moloshnikov, National Research Centre «Kurchatov Institute»

Research
Maksim S. Skorokhodov, National Research Centre «Kurchatov Institute»

Junior research
Aleksandr G. Aleksandr G., National Research Centre «Kurchatov Institute»

Grand PhD in Physics and Mathematics, Senior research

References

Vaswani A. [et al.] Attention is all you need // Advances in neural information processing systems. – 2017. – Т. 30.

Svoboda F. [et al.] Deep learning on microcontrollers: A study on deployment costs and challenges // Proceedings of the 2nd European Workshop on Machine Learning and Systems. – 2022. – C. 54–63.

Lin J. [et al.] Memory-efficient patch-based inference for tiny deep learning // Advances in Neural Information Processing Systems. – 2021. – Т. 34. – С. 2346–2358.

Yang J. [et al.] TinyFormer: Efficient Transformer Design and Deployment on Tiny Devices // arXiv preprint arXiv:2311.01759. – 2023.

Deng J. [et al.] Imagenet: A large-scale hierarchical image database // 2009 IEEE conference on computer vision and pattern recognition. – Ieee, 2009. – С. 248–255.

Nauen T. C. [et al.] Which Transformer to Favor: A Comparative Analysis of Efficiency in Vision Transformers // 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). – IEEE, 2025. – С. 6955–6966.

Kim S. [et al.] Full stack optimization of transformer inference: a survey // arXiv preprint arXiv:2302.14017. – 2023.

Khan S. [et al.] Transformers in vision: A survey // ACM computing surveys (CSUR). – 2022. – Т. 54, № 10s. – С. 1-41.

Lin T. Y. [et al.] Microsoft coco: Common objects in context // European conference on computer vision. – Cham : Springer International Publishing, 2014. – С. 740–755.

Zhou B. [et al.] Semantic understanding of scenes through the ade20k dataset // International Journal of Computer Vision. – 2019. – Т. 127, № 3. – С. 302–321.

Cordts M. [et al.] The cityscapes dataset for semantic urban scene understanding // Proceedings of the IEEE conference on computer vision and pattern recognition. – 2016. – С. 3213–3223.

Kay W. [et al.] The kinetics human action video dataset // arXiv preprint arXiv:1705.06950. – 2017.

Patro B. N., Agneeswaran V. S. Efficiency 360: Efficient vision transformers // arXiv preprint arXiv:2302.08374. – 2023.

Han K. [et al.] A survey on visual transformer // arXiv preprint arXiv:2012.12556. – 2020.

Yang Y. [et al.] Transformers meet visual learning understanding: A comprehensive review // arXiv preprint arXiv:2203.12944. – 2022.

Krizhevsky A. [et al.] Learning multiple layers of features from tiny images. – 2009.

Dendorfer P. [et al.] Motchallenge: A benchmark for single-camera multiple target tracking // arXiv preprint arXiv:2010.07548. – 2020.

Wang Y. [et al.] Vision transformers for image classification: A comparative survey // Technologies. – 2025. – Т. 13, № 1. – С. 32.

Liu Y. [et al.] A survey of visual transformers // IEEE transactions on neural networks and learning systems. – 2023. – Т. 35, №. 6. – С. 74787498.

Sun C. [et al.] Revisiting unreasonable effectiveness of data in deep learning era // Proceedings of the IEEE international conference on computer vision. – 2017. – С. 843–852.

Khalil M., Khalil A., Ngom A. A comprehensive study of vision transformers in image classification tasks // arXiv preprint arXiv:2312.01232. – 2023.

Khan A. [et al.] A survey of the vision transformers and their CNN-transformer based variants // Artificial Intelligence Review. – 2023. – Т. 56, № Suppl 3. – С. 2917-2970.

Dosovitskiy A. An image is worth 16x16 words: Transformers for image recognition at scale // arXiv preprint arXiv:2010.11929. – 2020.

Steiner A. [et al.] How to train your vit? data, augmentation, and regularization in vision transformers // arXiv preprint arXiv:2106.10270. – 2021.

Model card for vit_tiny_patch16_384.augreg_in21k_ft_in1k. – Режим доступа: https://huggingface.co/timm/vit_tiny_patch16_384.augreg_in21k_ft_in1k

Ali A. [et al.] Xcit: Cross-covariance image transformers // Advances in neural information processing systems. – 2021. – Т. 34. – С. 20014-20027.

Model card for xcit_tiny_12_p8_384.fb_dist_in1k – Режим доступа: https://huggingface.co/timm/xcit_tiny_12_p8_384.fb_dist_in1k

Maaz M. [et al.] Edgenext: efficiently amalgamated cnn-transformer architecture for mobile vision applications // European conference on computer vision. – Cham : Springer Nature Switzerland, 2022. – С. 3-20.

Model card for edgenext_small.usi_in1k. – Режим доступа: https://huggingface.co/timm/edgenext_small.usi_in1k

Wu K. [et al.] Tinyvit: Fast pretraining distillation for small vision transformers // European conference on computer vision. – Cham : Springer Nature Switzerland, 2022. – С. 68–85.

Model card for tiny_vit_5m_224.dist_in22k_ft_in1k. – Режим доступа: https://huggingface.co/timm/tiny_vit_5m_224.dist_in22k_ft_in1k

Fang Y. [et al.] Eva-02: A visual representation for neon genesis // Image and Vision Computing. – 2024. – Т. 149. – С. 105171.

Model card for eva02_tiny_patch14_336.mim_in22k_ft_in1k. – Режим доступа: https://huggingface.co/timm/eva02_tiny_patch14_336.mim_in22k_ft_in1k

Vasu P. K. A. [et al.] Fastvit: A fast hybrid vision transformer using structural reparameterization // Proceedings of the IEEE/CVF international conference on computer vision. – 2023. – С. 5785–5795.

Model card for fastvit_s12.apple_dist_in1k. – Режим доступа: https://huggingface.co/timm/fastvit_s12.apple_dist_in1k

Wang A. [et al.] Repvit: Revisiting mobile cnn from vit perspective // Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. – 2024. – С. 15909–15920.

Model card for repvit_m1_1.dist_450e_in1k. – Режим доступа: https://huggingface.co/timm/repvit_m1_1.dist_450e_in1k

Woo S. [et al.] Convnext v2: Co-designing and scaling convnets with masked autoencoders // Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. – 2023. – С. 16133–16142.

Model card for convnextv2_pico.fcmae_ft_in1k. – Режим доступа: https://huggingface.co/timm/convnextv2_pico.fcmae_ft_in1k

Li Y. [et al.] Rethinking vision transformers for mobilenet size and speed // Proceedings of the IEEE/CVF international conference on computer vision. – 2023. – С. 16889–16900.

Model card for efficientformerv2_s1.snap_dist_in1k – Режим доступа: https://huggingface.co/timm/efficientformerv2_s1.snap_dist_in1k

Tan M., Le Q. Efficientnet: Rethinking model scaling for convolutional neural networks // International conference on machine learning. – PMLR, 2019. – С. 6105-6114.

Xie Q. [et al.] Self-training with noisy student improves imagenet classification // Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. – 2020. – С. 10687–10698.

Model card for tf_efficientnet_b2.ns_jft_in1k. – Режим доступа: https://huggingface.co/timm/tf_efficientnet_b2.ns_jft_in1k

Qin D. [et al.] MobileNetV4: Universal models for the mobile ecosystem // Europan Conference on Computer Vision. – Cham : Springer Nature Switzerland, 2024. – С. 78–96.

Model card for mobilenetv4_conv_medium.e500_r256_in1k. – Режим доступа: https://huggingface.co/timm/mobilenetv4_conv_medium.e500_r256_in1k

Dollár P., Singh M., Girshick R. Fast and accurate model scaling // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. – 2021. – С. 924–932.

Model card for regnetz_b16.ra3_in1k. – Режим доступа: https://huggingface.co/timm/regnetz_b16.ra3_in1k

Wightman R. PyTorch Image Models. – Режим доступа: https://github.com/huggingface/pytorch-image-models

Fvcore. – Режим доступа: https://github.com/facebookresearch/fvcore

Taylor J. M., Kriegeskorte N. Extracting and visualizing hidden activations and computational graphs of PyTorch models with TorchLens // Scientific Reports. – 2023. – Т. 13, № 1. – С. 14375.