ТЕХНОЛОГІЯ ВИЗНАЧЕННЯ ІНФОРМАЦІЙНОГО ПОРЯДКУ ДЕННОГО В ПОТОКАХ НОВИННИХ ДАНИХ

Svitlana Petrasova; Nina Khairova; Anastasiia Kolesnyk

doi:10.20998/2079-0023.2021.01.14

Authors

Svitlana Petrasova National Technical University «Kharkiv Polytechnic Institute», Ukraine https://orcid.org/0000-0001-6011-135X
Nina Khairova National Technical University «Kharkiv Polytechnic Institute», Ukraine https://orcid.org/0000-0002-9826-0286
Anastasiia Kolesnyk National Technical University «Kharkiv Polytechnic Institute», Ukraine https://orcid.org/0000-0001-5817-0844

DOI:

https://doi.org/10.20998/2079-0023.2021.01.14

Abstract

Currently, the volume of news data streams is growing that contributes to increasing interest in systems that allow automating the big data streams processing. Based on intelligent data processing tools, the semantic similarity identification of text information will make it possible to select common information spaces of news. The article analyzes up-to-date statistical metrics for identifying coherent fragments, in particular, from news texts displaying the agenda, identifies the main advantages and disadvantages as well. The information technology is proposed for identifying the common information space of relevant news in the data stream for a certain period of time. The technology includes the logical-linguistic and distributive-statistical models for identifying collocations. The MI distributional semantic model is applied at the stage of potential collocation extraction. At the same time, regular expressions developed in accordance with the grammar of the English language make it possible to identify grammatically correct constructions. The advantage of the developed logical-linguistic model formalizing the semantic-grammatical characteristics of collocations, based on the use of algebraicpredicate operations and a semantic equivalence predicate, is that both the grammatical structure of the language and the meaning of words (collocates) are analyzed. The WordNet thesaurus is used to determine the synonymy relationship between the main and dependent collocation components. Based on the investigated corpus of news texts from the CNN and BBC services, the effectiveness of the developed technology is assessed. The analysis shows that the precision coefficient is 0.96. The use of the proposed technology could improve the quality of news streams processing. The solution to the problem of automatic identification of semantic similarity can be used to identify texts of the same domain, relevant information, extract facts and eliminate semantic ambiguity, etc.

Keywords: data stream, agenda, logical-linguistic model, distribution-statistical model, collocation, semantic similarity, WordNet, news text corpus, precision.

Author Biographies

Svitlana Petrasova, National Technical University «Kharkiv Polytechnic Institute»

candidate of technical sciences, docent, National Technical University «Kharkiv Polytechnic Institute», associate professor of the Department of Intelligent Computer Systems; Kharkiv, Ukraine; ORCID: https://orcid.org/ 0000-0001-6011-135X; e-mail: svetapetrasova@gmail.com.

Nina Khairova, National Technical University «Kharkiv Polytechnic Institute»

doctor of technical sciences, professor, National Technical University «Kharkiv Polytechnic Institute», professor of the Department of Intelligent Computer Systems; Kharkiv, Ukraine; ORCID: https://orcid.org/0000-0002-9826-0286; e-mail: nina_khajrova@yahoo.com.

Anastasiia Kolesnyk, National Technical University «Kharkiv Polytechnic Institute»

National Technical University «Kharkiv Polytechnic Institute», PhD student of the Department of Intelligent Computer Systems; Kharkiv, Ukraine; ORCID: https://orcid.org/0000-0001-5817-0844; e-mail: kolesniknastya20@gmail.com.

References

Kaminchenko D.I. Informacionnaja povestka dnja sovremennyh setevyh SMI: politicheskij aspekt [Information Agenda of Modern Online Media: Political Aspect]. Via in tempore. Istorija. Politologija [Via in tempore. History. Political science]. 2019, vol. 46, no. 3, pp. 576–584

Adams A., Harf A., Ford R. Agenda Setting Theory: A Critique of Maxwell McCombs & Donald Shaw’s Theory In Em Griffin’s A First Look at Communication Theory. Meta-communicate. 2014, vol. 4, no. 1. URI: https://journals.chapman.edu/ojs/index.php/mc/ article/view/902 (accessed 23.04.2021).

Lenci A. Distributional Models of Word Meaning. Annual Review of Linguistics. 2018, vol. 4, pp. 151–171. URI: https://www.annualreviews.org/doi/pdf/10.1146/annurev-linguistics030514-125254

Dinu A., Dinu L., Sorodoc I. Aggregation methods for efficient collocation detection. Proceedings of the Ninth International Conference on Language Resources and Evaluation. 2014, pp. 4041– 4045. URI: http://www.lrec-conf.org/proceedings/ lrec2014/pdf/ 1184_Paper.pdf (accessed 23.04.2021).

Hohlova M.V. Sopostavitel'nyj analiz statisticheskih mer na primere chasterechnyh preferencіj sochetaemosti sushhestvitel'nyh [Comparative Analysis of Statistical Measures on the Example of Part-of-Speech Preferences for Combining Nouns]. Komp'juternaja lingvistika i vychislitel'nye ontologii [Computational linguistics and computational ontologies]. 2017, issue 1, pp. 166–171.

Liu X., Huang D., Yin Zh., Ren F. Recognition of Collocation Frames from Sentences. IEICE Trans. Inf. Syst. 2019, 102-D, pp. 620-627. URI: https://doi.org/10.1587/TRANSINF.2018EDP7255

Hohlova M.V. K voprosu o shodstve mer associacii primenitel'no k zadache avtomaticheskogo izvlechenija glagol'nyh kollokacij [To the Question of The Similarity of Association Measures Applied to the Problem of Automatic Extraction of Verb Collocations]. Komp'juternaja lingvistika i vychislitel'nye ontologii [Computational linguistics and computational ontologies]. 2019, issue 3, pp. 9–18.

Petrasova S., Khairova N., Lewoniewski W., Mamyrbayev O., Mukhsina K. Similar Text Fragments Extraction for Identifying Common Wikipedia Communities. Data. Stream Mining and Processing. 2018, vol. 3, issue 4, article 66. URI: https://doi.org/10.3390/data3040066 (accessed 23.04.2021).

Bondarenko M., Shabanov-Kushnarenko Yu. Teorija intellekta [The Theoryof Intelligence]. Kharkiv, SMIT Publ., 2007. 576 p.

BBC. URI: https://www.bbc.com/news (accessed 23.04.2021).

CNN. URI: https://edition.cnn.com (accessed 23.04.2021).

Bobkova T. V. Osnovnі pіdhody do іdentifіkatsii i vyluchennia kolokatsiy іz tekstіv [Basic approaches to identification and extraction of collocations from texts]. Naukovі pratsі. Fіlologіa. Movoznavstvo [Scientific works. Philology. Linguistics]. 2015, no. 241 (253), pp. 10–16. URI: http://linguistics.chdu.edu.ua/ article/viewFile/87653/83242/ (accessed 23.04.2021).

Sketch Engine. URI: https://www.sketchengine.eu (accessed 23.04.2021).

TECHNOLOGY FOR IDENTIFICATION OF INFORMATION AGENDA IN NEWS DATA STREAMS

Authors

DOI:

Abstract

Author Biographies

Svitlana Petrasova, National Technical University «Kharkiv Polytechnic Institute»

Nina Khairova, National Technical University «Kharkiv Polytechnic Institute»

Anastasiia Kolesnyk, National Technical University «Kharkiv Polytechnic Institute»

References

Downloads

Published

How to Cite

Issue

Section

License

Information

Developed By