INTEGRATION OF HETEROGENEOUS DATA USING ARTIFICIAL INTELLIGENCE METHODS

Authors

DOI:

https://doi.org/10.20998/2079-0023.2025.02.12

Keywords:

multimodality, artificial intelligence, emotion classification, fusion architectures, audio-video-text processing, transformers, cross-modal attention

Abstract

Modern AI development and multimodal data analysis methods are gaining critical importance due to their ability to integrate information from diverse sources, including text, audio, sensor signals, and images. Such integration enables systems to form a richer and more context-aware understanding of complex environments, which is essential for domains such as healthcare diagnostics, adaptive education technologies, intelligent security systems, autonomous robotics, and various forms of human-computer interaction. Multimodal approaches also enable AI models to compensate for the limitations inherent in individual modalities, thereby enhancing robustness and resilience to noise or incomplete data. The study employs theoretical analysis of scientific literature, comparative classification of multimodal architectures, systematization of fusion techniques, and formal generalization of model design principles. Additionally, attention is given to evaluating emerging paradigms powered by large-scale foundation models and transformer-based architectures. The primary methods and models for processing multimodal data are summarized, covering both classical and state-of-the-art approaches. Architectures of early (feature-level), late (decision-level), and hybrid (intermediate) fusion are described and compared in terms of flexibility, computational complexity, interpretability, and accuracy. Emerging solutions based on large multimodal transformer models, contrastive learning, and unified embedding spaces are also analyzed. Special attention is paid to cross-modal attention mechanisms that enable dynamic weighting of modalities depending on task context. The study determines that multimodal systems achieve significantly higher accuracy, stability, and semantic coherence in classification, detection, and interpretation tasks when modalities are properly synchronized and fused using adaptive strategies. These findings underscore the promise of further research toward scalable architectures capable of real-time multimodal reasoning, improved cross-modal transfer, and context-aware attention mechanisms.

Author Biographies

Oleh Zherebetskyi, Lviv Polytechnic National University

PhD student at the Department of Artificial Intelligence Systems, Lviv Polytechnic National University, Lviv, Ukraine

Oleh Basystiuk, Lviv Polytechnic National University

Candidate of Technical Sciences (PhD), Senior Lecturer at the Department of Artificial Intelligence Systems, Lviv Polytechnic National University, Lviv, Ukraine

References

Yuan Y., Li Z., Zhao B. A Survey of Multimodal Learning: Methods, Applications, and Future. ACM Computing Surveys. 2025. Vol. 57, no. 7, pp. 1–34. DOI: 10.1145/3713070.

Golovanevsky M., Schiller E., Nair A., Han E., Singh R., Eickhoff C. One-Versus-Others Attention: Scalable Multimodal Integration for Biomedical Data. Pacific Symposium on Biocomputing 2025: Biocomputing 2025. Kohala Coast, Hawaii, USA: WORLD SCIENTIFIC, 2024, pp. 580–593. DOI: 10.1142/9789819807024_0041.

Xue J., Wang Y., Tian Y., Li Y., Shi L., Wei L. Detecting fake news by exploring the consistency of multimodal data. Information Processing & Management. 2021, vol. 58, no. 5, pp. 102610-102624. DOI: 10.1016/j.ipm.2021.102610.

Xie Y., Yang L., Zhang M., Chen S., Li J. A Review of Multimodal Interaction in Remote Education: Technologies, Applications, and Challenges. Applied Sciences. 2025, vol. 15, no. 7, pp. 3937–3964. DOI: 10.3390/app15073937.

Wilson A., Wilkes S., Teramoto Y., Hale S. Multimodal analysis of disinformation and misinformation. Royal Society Open Science. 2023, vol. 10, no. 12, pp. 230964–230989. DOI: 10.1098/rsos.230964.

Alaba S. Y., Gurbuz A. C., Ball J. E. Emerging Trends in Autonomous Vehicle Perception: Multimodal Fusion for 3D Object Detection. World Electric Vehicle Journal. 2024, vol. 15, no. 1, pp. 20–30. DOI: 10.3390/wevj15010020.

Warner E., Lee J., Hsu W., Syeda-Mahmood T., Kahn C. E., Gevaert O., Rao A. Multimodal Machine Learning in Image-Based and Clinical Biomedicine: Survey and Prospects. International Journal of Computer Vision. 2024, vol. 132, no. 9, pp. 3753–3769. DOI: 10.1007/s11263-024-02032-8.

Lian H., Lu C., Li S., Zhao Y., Tang C., Zong Y. A Survey of Deep Learning-Based Multimodal Emotion Recognition: Speech, Text, and Face. Entropy. 2023, vol. 25, no. 10, pp. 1440–1473. DOI: 10.3390/e25101440.

Khan M., Tran P.-N., Pham N. T., El Saddik A., Othmani A. MemoCMT: multimodal emotion recognition using cross-modal transformer-based feature fusion. Scientific Reports. 2025, vol. 15, no. 1, pp. 5473–5486. DOI: 10.1038/s41598-025-89202-x.

Udahemuka G., Djouani K., Kurien A. M. Multimodal Emotion Recognition Using Visual, Vocal and Physiological Signals: A Review. Applied Sciences. 2024, vol. 14, no. 17, pp. 8071–8115. DOI: 10.3390/app14178071.

/ Caschera M. C., Grifoni P., Ferri F. Emotion Classification from Speech and Text in Videos Using a Multimodal Approach. Multimodal Technologies and Interaction. 2022, vol. 6, no. 4, pp. 28–54. DOI: 10.3390/mti6040028.

Tsai Y. H., Bai S., Liang P. P., Kolter J. Z., Morency L. P., Salakhutdinov R. Multimodal Transformer for Unaligned Multimodal Language Sequences. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, 2019, pp. 6558–6569. DOI: 10.18653/v1/P19-1656.

Farhadizadeh M., Weymann M., Blaß M., Kraus J., Gundler C., Walter S., Hempen N., Binder H., Binder N. A Systematic Review of Challenges and Proposed Solutions in Modeling Multimodal Data. arXiv. 2025. DOI: 10.48550/ARXIV.2505.06945.

Wu Y., Zhang S., Li P. Multi-modal emotion recognition in conversation based on prompt learning with text-audio fusion features. Scientific Reports. 2025, vol. 15, no. 1, pp. 8855–8888. DOI: 10.1038/s41598-025-89758-8.

Das A., Sarma M. S., Hoque M. M., Siddique N., Dewan M. A. A. AVaTER: Fusing Audio, Visual, and Textual Modalities Using Cross-Modal Attention for Emotion Recognition. Sensors. 2024, vol. 24, no. 18, pp. 5862–5886. DOI: 10.3390/s24185862.

Xu P., Zhu X., Clifton D. A. Multimodal Learning with Transformers: A Survey. arXiv. 2023. DOI: 10.48550/arXiv.2206.06488.

Alayrac J. B., Donahue J., Luc P., Miech A., Barr I. та ін. A Visual Language Model for Few-Shot Learning. arXiv. 2022. DOI: 10.48550/ARXIV.2204.14198.

Sun C., Myers A., Vondrick C., Murphy K., Schmid C. VideoBERT: A Joint Model for Video and Language Representation Learning. arXiv. 2019. DOI: 10.48550/ARXIV.1904.01766.

Sun Z., Lin M., Zhu Q., Xie Q., Wang F., Lu Z., Peng Y. A scoping review on multimodal deep learning in biomedical images and texts. Journal of Biomedical Informatics. 2023, vol. 146, pp. 104482–104502. DOI: 10.1016/j.jbi.2023.104482.

Kaczmarczyk R., Wilhelm T. I., Martin R., Roos J. Evaluating multimodal AI in medical diagnostics. Digital Medicine. 2024, vol. 7, no. 1, pp. 205–210. DOI: 10.1038/s41746-024-01208-3.

Downloads

Published

2025-12-29

How to Cite

Zherebetskyi, O., & Basystiuk, O. (2025). INTEGRATION OF HETEROGENEOUS DATA USING ARTIFICIAL INTELLIGENCE METHODS. Bulletin of National Technical University "KhPI". Series: System Analysis, Control and Information Technologies, (2 (14), 90–95. https://doi.org/10.20998/2079-0023.2025.02.12

Issue

Section

INFORMATION TECHNOLOGY