TECHNOLOGY FOR IDENTIFICATION OF INFORMATION AGENDA IN NEWS DATA STREAMS
Currently, the volume of news data streams is growing that contributes to increasing interest in systems that allow automating the big data streams processing. Based on intelligent data processing tools, the semantic similarity identification of text information will make it possible to select common information spaces of news. The article analyzes up-to-date statistical metrics for identifying coherent fragments, in particular, from news texts displaying the agenda, identifies the main advantages and disadvantages as well. The information technology is proposed for identifying the common information space of relevant news in the data stream for a certain period of time. The technology includes the logical-linguistic and distributive-statistical models for identifying collocations. The MI distributional semantic model is applied at the stage of potential collocation extraction. At the same time, regular expressions developed in accordance with the grammar of the English language make it possible to identify grammatically correct constructions. The advantage of the developed logical-linguistic model formalizing the semantic-grammatical characteristics of collocations, based on the use of algebraicpredicate operations and a semantic equivalence predicate, is that both the grammatical structure of the language and the meaning of words (collocates) are analyzed. The WordNet thesaurus is used to determine the synonymy relationship between the main and dependent collocation components. Based on the investigated corpus of news texts from the CNN and BBC services, the effectiveness of the developed technology is assessed. The analysis shows that the precision coefficient is 0.96. The use of the proposed technology could improve the quality of news streams processing. The solution to the problem of automatic identification of semantic similarity can be used to identify texts of the same domain, relevant information, extract facts and eliminate semantic ambiguity, etc.
Keywords: data stream, agenda, logical-linguistic model, distribution-statistical model, collocation, semantic similarity, WordNet, news text corpus, precision.
Kaminchenko D.I. Informacionnaja povestka dnja sovremennyh setevyh SMI: politicheskij aspekt [Information Agenda of Modern Online Media: Political Aspect]. Via in tempore. Istorija. Politologija [Via in tempore. History. Political science]. 2019, vol. 46, no. 3, pp. 576–584
Adams A., Harf A., Ford R. Agenda Setting Theory: A Critique of Maxwell McCombs & Donald Shaw’s Theory In Em Griffin’s A First Look at Communication Theory. Meta-communicate. 2014, vol. 4, no. 1. URI: https://journals.chapman.edu/ojs/index.php/mc/ article/view/902 (accessed 23.04.2021).
Lenci A. Distributional Models of Word Meaning. Annual Review of Linguistics. 2018, vol. 4, pp. 151–171. URI: https://www.annualreviews.org/doi/pdf/10.1146/annurev-linguistics030514-125254
Dinu A., Dinu L., Sorodoc I. Aggregation methods for efficient collocation detection. Proceedings of the Ninth International Conference on Language Resources and Evaluation. 2014, pp. 4041– 4045. URI: http://www.lrec-conf.org/proceedings/ lrec2014/pdf/ 1184_Paper.pdf (accessed 23.04.2021).
Hohlova M.V. Sopostavitel'nyj analiz statisticheskih mer na primere chasterechnyh preferencіj sochetaemosti sushhestvitel'nyh [Comparative Analysis of Statistical Measures on the Example of Part-of-Speech Preferences for Combining Nouns]. Komp'juternaja lingvistika i vychislitel'nye ontologii [Computational linguistics and computational ontologies]. 2017, issue 1, pp. 166–171.
Liu X., Huang D., Yin Zh., Ren F. Recognition of Collocation Frames from Sentences. IEICE Trans. Inf. Syst. 2019, 102-D, pp. 620-627. URI: https://doi.org/10.1587/TRANSINF.2018EDP7255
Hohlova M.V. K voprosu o shodstve mer associacii primenitel'no k zadache avtomaticheskogo izvlechenija glagol'nyh kollokacij [To the Question of The Similarity of Association Measures Applied to the Problem of Automatic Extraction of Verb Collocations]. Komp'juternaja lingvistika i vychislitel'nye ontologii [Computational linguistics and computational ontologies]. 2019, issue 3, pp. 9–18.
Petrasova S., Khairova N., Lewoniewski W., Mamyrbayev O., Mukhsina K. Similar Text Fragments Extraction for Identifying Common Wikipedia Communities. Data. Stream Mining and Processing. 2018, vol. 3, issue 4, article 66. URI: https://doi.org/10.3390/data3040066 (accessed 23.04.2021).
Bondarenko M., Shabanov-Kushnarenko Yu. Teorija intellekta [The Theoryof Intelligence]. Kharkiv, SMIT Publ., 2007. 576 p.
BBC. URI: https://www.bbc.com/news (accessed 23.04.2021).
CNN. URI: https://edition.cnn.com (accessed 23.04.2021).
Bobkova T. V. Osnovnі pіdhody do іdentifіkatsii i vyluchennia kolokatsiy іz tekstіv [Basic approaches to identification and extraction of collocations from texts]. Naukovі pratsі. Fіlologіa. Movoznavstvo [Scientific works. Philology. Linguistics]. 2015, no. 241 (253), pp. 10–16. URI: http://linguistics.chdu.edu.ua/ article/viewFile/87653/83242/ (accessed 23.04.2021).
Sketch Engine. URI: https://www.sketchengine.eu (accessed 23.04.2021).
How to Cite
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).