TOPIC SEGMENTATION METHODS COMPARISON ON COMPUTER SCIENCE TEXTS
Keywords:topic segmentation, TextTiling, TextSeg, Inspec, IT Companies, computer science texts
The demand for the creation of information systems that simplifies and accelerates work has greatly increased in the context of the rapid
informatization of society and all its branches. It provokes the emergence of more and more companies involved in the development of software
products and information systems in general. In order to ensure the systematization, processing and use of this knowledge, knowledge management
systems are used. One of the main tasks of IT companies is continuous training of personnel. This requires export of the content from the company's
knowledge management system to the learning management system. The main goal of the research is to choose an algorithm that allows solving the
problem of marking up the text of articles close to those used in knowledge management systems of IT companies. To achieve this goal, it is necessary
to compare various topic segmentation methods on a dataset with a computer science texts. Inspec is one such dataset used for keyword extraction and
in this research it has been adapted to the structure of the datasets used for the topic segmentation problem. The TextTiling and TextSeg methods were
used for comparison on some well-known data science metrics and specific metrics that relate to the topic segmentation problem. A new generalized
metric was also introduced to compare the results for the topic segmentation problem. All software implementations of the algorithms were written in
Python programming language and represent a set of interrelated functions. Results were obtained showing the advantages of the Text Seg method in
comparison with TextTiling when compared using classical data science metrics and special metrics developed for the topic segmentation task. From
all the metrics, including the introduced one it can be concluded that the TextSeg algorithm performs better than the TextTiling algorithm on the
adapted Inspec test data set.
Purver M. Topic Segmentation. Spoken Language Understanding. John Wiley & Sons, Ltd, Chichester, UK, 2011, pp. 291–317.
Hearst M. A. TextTiling: Segmenting Text into Multi-paragraph Subtopic Passages. Computational Linguistics. 1997. no 23 (1). pp. 33–64.
Galley M., McKeown K., Fosler-Lussier E., Jing H. Discourse segmentation of multi-party conversation. Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (ACL), 2003. pp. 562–569.
Georgescul M, Clark A and Armstrong S. Word distributions for thematic segmentation in a support vector machine approach. Proceedings of the 10th Conference on Computational Natural Language Learning (CoNLLX). New York City, New York, 2006. pp. 101–108.
Reynar J. An automatic method of finding topic boundaries. Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics (ACL). Association for Computational Linguistics, LasCruces, NM. 1994. pp. 331–333.
Mulbregt P. V., Carp I., Gillick L., Lowe S., Yamron J. Segmentation of automatically transcribed broadcast news text. Proceedings of the DARPA Broadcast News Workshop. Morgan Kaufmann. 1999. pp. 77–80.
Yamron J., Carp I., Gillick L., Lowe S., van Mulbregt P. A hidden Markov model approach to text segmentation and event tracking. Proceedings of the IEEE Conference on Acoustics, Speech, and Signal Processing. 1998. pp. 333–336.
Blei D., Moreno P. Topic segmentation with an aspect hidden Markov model. Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 2001. pp. 343–348.
Utiyama M., Isahara H. A Statistical Model for Domain-Independent Text Segmentation. Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics. 2001. pp. 499–506.
Eisenstein J., Barzilay R. Bayesian unsupervised topic segmentation Proceedings of the 2008 Conferenceon Empirical Methods in Natural Language Processing, Association for Computational Linguistics,Honolulu, Hawaii. 2008. pp. 334–343.
Beeferman D, Berger A., Lafferty JD. Statistical models for text segmentation. Machine Learning. 1999. no 34(1–3). pp. 177–210.
Pevzner L and Hearst M. A critique and improvement of an evaluation metric for text segmentation. Computational Linguistics. 2002. no 28 (1). pp. 19–36.
Choi F. Advances in Domain Independent Linear Text Segmentation Proceedings of 1st Meeting of the North American Chapter of the Association for Computational Linguistics, 2000. pp. 26–33.
How to Cite
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).