DEVELOPMENT AND COMPARATIVE ANALYSIS OF SEMI-SUPERVISED LEARNING ALGORITHMS ON A SMALL AMOUNT OF LABELED DATA
The paper is dedicated to the development and comparative experimental analysis of semi-supervised learning approaches based on a mix of unsupervised and supervised approaches for the classification of datasets with a small amount of labeled data, namely, identifying to which of a set of categories a new observation belongs using a training set of data containing observations whose category membership is known. Semi-supervised learning is an approach to machine learning that combines a small amount of labeled data with a large amount of unlabeled data during training. Unlabeled data, when used in combination with a small quantity of labeled data, can produce significant improvement in learning accuracy. The goal is semi-supervised methods development and analysis along with comparing their accuracy and robustness on different synthetics datasets. The proposed approach is based on the unsupervised K-medoids methods, also known as the Partitioning Around Medoid algorithm, however, unlike Kmedoids the proposed algorithm first calculates medoids using only labeled data and next process unlabeled classes – assign labels of nearest medoid. Another proposed approach is the mix of the supervised method of K-nearest neighbor and unsupervised K-Means. Thus, the proposed learning algorithm uses information about both the nearest points and classes centers of mass. The methods have been implemented using Python programming language and experimentally investigated for solving classification problems using datasets with different distribution and spatial characteristics. Datasets were generated using the scikit-learn library. Was compared the developed approaches to find average accuracy on all these datasets. It was shown, that even small amounts of labeled data allow us to use semi-supervised learning, and proposed modifications ensure to improve accuracy and algorithm performance, which was demonstrated during experiments. And with the increase of available label information accuracy of the algorithms grows up. Thus, the developed algorithms are using a distance metric that considers available label information.
Keywords: Unsupervised learning, supervised learning. semi-supervised learning, clustering, distance, distance function, nearest neighbor, medoid, center of mass.
Hinton G., Sejnowski T. Unsupervised Learning: Foundations of Neural Computation. MIT Press, 1999. 391 p.
Lyubchyk L. M., Galuza O. A., Grinberg G. M. Semi-supervised Learning to Rank with Nonlinear Preference. Recent Developments in Fuzzy Logic and Fuzzy Sets. Studies in Fuzziness and Soft Computing. Lviv, Springer, 2019, vol. 391, pp. 81–103.
Basu S., Bilenko M., Banerjee A., Mooney. R. J. Probabilistic semisupervised clustering with constraints. MIT Press, 2006, pp. 73–102.
Jesper E., Holger H. A survey on semi-supervised learning. Available at: https://doi.org/10.1007/s10994-019-05855-6.
Bair E. Semi-supervised clustering. Available at: https://arxiv.org/pdf/1307.0252.pdf.
Kaufman, L., Rousseeuw P. J. Finding groups in data: an introduction to cluster analysis. New York, Wiley, 1990. 342 p.
Cover T., Hart P. Nearest neighbor pattern classification. IEEE Transactions on Information Theory. 1967, vol. 13, no. 1, pp. 21–27.
Huang T., Kecman V., Kopriva I. Kernel Based Algorithms for Mining Huge Data. New York, Springer, 2006. 208 p.
Vapnik V. N. Statistical Learning Theory. New York, Wiley, 1998. 768 p.
Wang J., Shen X., Pan W. Transductive Support Vector Machines, Contemporary Mathematics. 2007, vol. 443, pp. 7–19.
Rossum G. Python programming language. Available at: http://www.python.org.
Cournapeau D. Scikit-learn. machine learning library for the Python programming language. Available at: https://scikit-learn.org/stable/.
How to Cite
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).