TOPIC SEGMENTATION METHODS COMPARISON ON COMPUTER SCIENCE TEXTS

Authors

DOI:

https://doi.org/10.20998/2079-0023.2021.02.10

Keywords:

topic segmentation, TextTiling, TextSeg, Inspec, IT Companies, computer science texts

Abstract

The demand for the creation of information systems that simplifies and accelerates work has greatly increased in the context of the rapid
informatization of society and all its branches. It provokes the emergence of more and more companies involved in the development of software
products and information systems in general. In order to ensure the systematization, processing and use of this knowledge, knowledge management
systems are used. One of the main tasks of IT companies is continuous training of personnel. This requires export of the content from the company's
knowledge management system to the learning management system. The main goal of the research is to choose an algorithm that allows solving the
problem of marking up the text of articles close to those used in knowledge management systems of IT companies. To achieve this goal, it is necessary
to compare various topic segmentation methods on a dataset with a computer science texts. Inspec is one such dataset used for keyword extraction and
in this research it has been adapted to the structure of the datasets used for the topic segmentation problem. The TextTiling and TextSeg methods were
used for comparison on some well-known data science metrics and specific metrics that relate to the topic segmentation problem. A new generalized
metric was also introduced to compare the results for the topic segmentation problem. All software implementations of the algorithms were written in
Python programming language and represent a set of interrelated functions. Results were obtained showing the advantages of the Text Seg method in
comparison with TextTiling when compared using classical data science metrics and special metrics developed for the topic segmentation task. From
all the metrics, including the introduced one it can be concluded that the TextSeg algorithm performs better than the TextTiling algorithm on the
adapted Inspec test data set.

Author Biographies

Volodymyr Sokol, National Technical University "Kharkiv Polytechnic Institute"

PhD, Associate Professor, National Technical University «Kharkov Polytechnic Institute», Associate Professor of the Department of Software Engineering and Management Information Technologies; Kharkiv, Ukraine

Vitalii Krykun, National Technical University "Kharkiv Polytechnic Institute"

National Technical University «Kharkov Polytechnic Institute», student; Kharkiv, Ukraine

Mariia Bilova, National Technical University "Kharkiv Polytechnic Institute"

PhD, National Technical University «Kharkov Polytechnic Institute», Associate Professor of the Department of Software Engineering and Management Information Technologies; Kharkiv, Ukraine

Ivan Perepelytsya, National Technical University "Kharkiv Polytechnic Institute"

PhD, National Technical University «Kharkov Polytechnic Institute», Associate Professor of the Department of Software Engineering and Management Information Technologies; Kharkiv, Ukraine

Volodymyr Pustovarov, Kharkiv office of the General Customer - State Space Agency of Ukraine

PhD, group leader, Kharkiv office of the General Customer - State Space Agency of Ukraine.; Kharkiv, Ukraine

Volodymyr Pustovarov, Kharkiv office of the General Customer - State Space Agency of Ukraine

PhD, group leader, Kharkiv office of the General Customer - State Space Agency of Ukraine.; Kharkiv, Ukraine

References

Purver M. Topic Segmentation. Spoken Language Understanding. John Wiley & Sons, Ltd, Chichester, UK, 2011, pp. 291–317.

Hearst M. A. TextTiling: Segmenting Text into Multi-paragraph Subtopic Passages. Computational Linguistics. 1997. no 23 (1). pp. 33–64.

Galley M., McKeown K., Fosler-Lussier E., Jing H. Discourse segmentation of multi-party conversation. Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (ACL), 2003. pp. 562–569.

Georgescul M, Clark A and Armstrong S. Word distributions for thematic segmentation in a support vector machine approach. Proceedings of the 10th Conference on Computational Natural Language Learning (CoNLLX). New York City, New York, 2006. pp. 101–108.

Reynar J. An automatic method of finding topic boundaries. Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics (ACL). Association for Computational Linguistics, LasCruces, NM. 1994. pp. 331–333.

Mulbregt P. V., Carp I., Gillick L., Lowe S., Yamron J. Segmentation of automatically transcribed broadcast news text. Proceedings of the DARPA Broadcast News Workshop. Morgan Kaufmann. 1999. pp. 77–80.

Yamron J., Carp I., Gillick L., Lowe S., van Mulbregt P. A hidden Markov model approach to text segmentation and event tracking. Proceedings of the IEEE Conference on Acoustics, Speech, and Signal Processing. 1998. pp. 333–336.

Blei D., Moreno P. Topic segmentation with an aspect hidden Markov model. Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 2001. pp. 343–348.

Utiyama M., Isahara H. A Statistical Model for Domain-Independent Text Segmentation. Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics. 2001. pp. 499–506.

Eisenstein J., Barzilay R. Bayesian unsupervised topic segmentation Proceedings of the 2008 Conferenceon Empirical Methods in Natural Language Processing, Association for Computational Linguistics,Honolulu, Hawaii. 2008. pp. 334–343.

Beeferman D, Berger A., Lafferty JD. Statistical models for text segmentation. Machine Learning. 1999. no 34(1–3). pp. 177–210.

Pevzner L and Hearst M. A critique and improvement of an evaluation metric for text segmentation. Computational Linguistics. 2002. no 28 (1). pp. 19–36.

Choi F. Advances in Domain Independent Linear Text Segmentation Proceedings of 1st Meeting of the North American Chapter of the Association for Computational Linguistics, 2000. pp. 26–33.

Downloads

Published

2021-12-28

How to Cite

Sokol, V., Krykun, V., Bilova, M., Perepelytsya, I., Pustovarov, V., & Pustovarov, V. (2021). TOPIC SEGMENTATION METHODS COMPARISON ON COMPUTER SCIENCE TEXTS. Bulletin of National Technical University "KhPI". Series: System Analysis, Control and Information Technologies, (2 (6), 59–66. https://doi.org/10.20998/2079-0023.2021.02.10

Issue

Section

INFORMATION TECHNOLOGY