INTEGRATION OF HETEROGENEOUS DATA USING ARTIFICIAL INTELLIGENCE METHODS
DOI:
https://doi.org/10.20998/2079-0023.2025.02.12Keywords:
multimodality, artificial intelligence, emotion classification, fusion architectures, audio-video-text processing, transformers, cross-modal attentionAbstract
Modern AI development and multimodal data analysis methods are gaining critical importance due to their ability to integrate information from diverse sources, including text, audio, sensor signals, and images. Such integration enables systems to form a richer and more context-aware understanding of complex environments, which is essential for domains such as healthcare diagnostics, adaptive education technologies, intelligent security systems, autonomous robotics, and various forms of human-computer interaction. Multimodal approaches also enable AI models to compensate for the limitations inherent in individual modalities, thereby enhancing robustness and resilience to noise or incomplete data. The study employs theoretical analysis of scientific literature, comparative classification of multimodal architectures, systematization of fusion techniques, and formal generalization of model design principles. Additionally, attention is given to evaluating emerging paradigms powered by large-scale foundation models and transformer-based architectures. The primary methods and models for processing multimodal data are summarized, covering both classical and state-of-the-art approaches. Architectures of early (feature-level), late (decision-level), and hybrid (intermediate) fusion are described and compared in terms of flexibility, computational complexity, interpretability, and accuracy. Emerging solutions based on large multimodal transformer models, contrastive learning, and unified embedding spaces are also analyzed. Special attention is paid to cross-modal attention mechanisms that enable dynamic weighting of modalities depending on task context. The study determines that multimodal systems achieve significantly higher accuracy, stability, and semantic coherence in classification, detection, and interpretation tasks when modalities are properly synchronized and fused using adaptive strategies. These findings underscore the promise of further research toward scalable architectures capable of real-time multimodal reasoning, improved cross-modal transfer, and context-aware attention mechanisms.
References
Yuan Y., Li Z., Zhao B. A Survey of Multimodal Learning: Methods, Applications, and Future. ACM Computing Surveys. 2025. Vol. 57, no. 7, pp. 1–34. DOI: 10.1145/3713070.
Golovanevsky M., Schiller E., Nair A., Han E., Singh R., Eickhoff C. One-Versus-Others Attention: Scalable Multimodal Integration for Biomedical Data. Pacific Symposium on Biocomputing 2025: Biocomputing 2025. Kohala Coast, Hawaii, USA: WORLD SCIENTIFIC, 2024, pp. 580–593. DOI: 10.1142/9789819807024_0041.
Xue J., Wang Y., Tian Y., Li Y., Shi L., Wei L. Detecting fake news by exploring the consistency of multimodal data. Information Processing & Management. 2021, vol. 58, no. 5, pp. 102610-102624. DOI: 10.1016/j.ipm.2021.102610.
Xie Y., Yang L., Zhang M., Chen S., Li J. A Review of Multimodal Interaction in Remote Education: Technologies, Applications, and Challenges. Applied Sciences. 2025, vol. 15, no. 7, pp. 3937–3964. DOI: 10.3390/app15073937.
Wilson A., Wilkes S., Teramoto Y., Hale S. Multimodal analysis of disinformation and misinformation. Royal Society Open Science. 2023, vol. 10, no. 12, pp. 230964–230989. DOI: 10.1098/rsos.230964.
Alaba S. Y., Gurbuz A. C., Ball J. E. Emerging Trends in Autonomous Vehicle Perception: Multimodal Fusion for 3D Object Detection. World Electric Vehicle Journal. 2024, vol. 15, no. 1, pp. 20–30. DOI: 10.3390/wevj15010020.
Warner E., Lee J., Hsu W., Syeda-Mahmood T., Kahn C. E., Gevaert O., Rao A. Multimodal Machine Learning in Image-Based and Clinical Biomedicine: Survey and Prospects. International Journal of Computer Vision. 2024, vol. 132, no. 9, pp. 3753–3769. DOI: 10.1007/s11263-024-02032-8.
Lian H., Lu C., Li S., Zhao Y., Tang C., Zong Y. A Survey of Deep Learning-Based Multimodal Emotion Recognition: Speech, Text, and Face. Entropy. 2023, vol. 25, no. 10, pp. 1440–1473. DOI: 10.3390/e25101440.
Khan M., Tran P.-N., Pham N. T., El Saddik A., Othmani A. MemoCMT: multimodal emotion recognition using cross-modal transformer-based feature fusion. Scientific Reports. 2025, vol. 15, no. 1, pp. 5473–5486. DOI: 10.1038/s41598-025-89202-x.
Udahemuka G., Djouani K., Kurien A. M. Multimodal Emotion Recognition Using Visual, Vocal and Physiological Signals: A Review. Applied Sciences. 2024, vol. 14, no. 17, pp. 8071–8115. DOI: 10.3390/app14178071.
/ Caschera M. C., Grifoni P., Ferri F. Emotion Classification from Speech and Text in Videos Using a Multimodal Approach. Multimodal Technologies and Interaction. 2022, vol. 6, no. 4, pp. 28–54. DOI: 10.3390/mti6040028.
Tsai Y. H., Bai S., Liang P. P., Kolter J. Z., Morency L. P., Salakhutdinov R. Multimodal Transformer for Unaligned Multimodal Language Sequences. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, 2019, pp. 6558–6569. DOI: 10.18653/v1/P19-1656.
Farhadizadeh M., Weymann M., Blaß M., Kraus J., Gundler C., Walter S., Hempen N., Binder H., Binder N. A Systematic Review of Challenges and Proposed Solutions in Modeling Multimodal Data. arXiv. 2025. DOI: 10.48550/ARXIV.2505.06945.
Wu Y., Zhang S., Li P. Multi-modal emotion recognition in conversation based on prompt learning with text-audio fusion features. Scientific Reports. 2025, vol. 15, no. 1, pp. 8855–8888. DOI: 10.1038/s41598-025-89758-8.
Das A., Sarma M. S., Hoque M. M., Siddique N., Dewan M. A. A. AVaTER: Fusing Audio, Visual, and Textual Modalities Using Cross-Modal Attention for Emotion Recognition. Sensors. 2024, vol. 24, no. 18, pp. 5862–5886. DOI: 10.3390/s24185862.
Xu P., Zhu X., Clifton D. A. Multimodal Learning with Transformers: A Survey. arXiv. 2023. DOI: 10.48550/arXiv.2206.06488.
Alayrac J. B., Donahue J., Luc P., Miech A., Barr I. та ін. A Visual Language Model for Few-Shot Learning. arXiv. 2022. DOI: 10.48550/ARXIV.2204.14198.
Sun C., Myers A., Vondrick C., Murphy K., Schmid C. VideoBERT: A Joint Model for Video and Language Representation Learning. arXiv. 2019. DOI: 10.48550/ARXIV.1904.01766.
Sun Z., Lin M., Zhu Q., Xie Q., Wang F., Lu Z., Peng Y. A scoping review on multimodal deep learning in biomedical images and texts. Journal of Biomedical Informatics. 2023, vol. 146, pp. 104482–104502. DOI: 10.1016/j.jbi.2023.104482.
Kaczmarczyk R., Wilhelm T. I., Martin R., Roos J. Evaluating multimodal AI in medical diagnostics. Digital Medicine. 2024, vol. 7, no. 1, pp. 205–210. DOI: 10.1038/s41746-024-01208-3.
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).
