DEVELOPMENT AND COMPARATIVE ANALYSIS OF SEMI-SUPERVISED LEARNING ALGORITHMS ON A

The paper is dedicated to the development and comparative experimental analysis of semi-supervised learning approaches based on a mix of unsupervised and supervised approaches for the classification of datasets with a small amount of labeled data, namely, identifying to which of a set of categories a new observation belongs using a training set of data containing observations whose category membership is known. Semi-supervised learning is an approach to machine learning that combines a small amount of labeled data with a large amount of unlabeled data during training. Unlabeled data, when used in combination with a small quantity of labeled data, can produce significant improvement in learning accuracy. The goal is semi-supervised methods development and analysis along with comparing their accuracy and robustness on different synthetics datasets. The proposed approach is based on the unsupervised K -medoids methods, also known as the Partitioning Around Medoid algorithm, however, unlike K - medoids the proposed algorithm first calculates medoids using only labeled data and next process unlabeled classes – assign labels of nearest medoid. Another proposed approach is the mix of the supervised method of K -nearest neighbor and unsupervised K -Means. Thus, the proposed learning algorithm uses information about both the nearest points and classes centers of mass. The methods have been implemented using Python programming language and experimentally investigated for solving classification problems using datasets with different distribution and spatial characteristics. Datasets were generated using the scikit-learn library. Was compared the developed approaches to find average accuracy on all these datasets. It was shown, that even small amounts of labeled data allow us to use semi-supervised learning, and proposed modifications ensure to improve accuracy and algorithm performance, which was demonstrated during experiments. And with the increase of available label information accuracy of the algorithms grows up. Thus, the developed algorithms are using a distance metric that considers available label information.


Introduction.
A large amount of data was produced recently, and nowadays humanity had the opportunity to store and process all this data. In all spheres of life people try to use these data for optimizing business and lifeimproving using AI and data mining.
There are several approaches to data processing and analysis problem within the framework of machine learning paradigms. One of them is unsupervised learning [1] when we try to detect inner structure or patterns without human supervision. The most efficient approach in machine learning is supervised learning, when we have some data with labels and try to learn a function on data points as label pairs. In many cases, there is no opportunity to label all data from different cases, causes are too complex and expensive experiments, data streaming with large frequency [2], or just high cost of data labeling. Therefore, in this case a satisfactory compromise is semi-supervised learning when we use datasets with a small amount of labeled that allows learning better inner structure (fig 1). Fig. 1. Example of unlabeled data in semi-supervised learning Semi-supervised learning includes different approaches, and can be used for any popular data analysis problems, such as clustering [3], anomaly detection, latent variables models.
The object of the study is the process of the data points classifications, namely, identifying to which of a set of categories a new observation belongs using a training set of data containing observations whose category membership is known.
The subject of the study is development of semisupervised methods for data classification.
The purpose of the work is to develop an improved semi-supervised method using already exist supervised and unsupervised approaches and compare their accuracy and robustness.
Problem statement. Given a set of labeled examples {< 1 , 1 >, … , < , >}, where feature vector of i-th example andits label (class), and a set of unlabeled data { +1 , … , + } 1 , 2 , … , + ∈ and 1 , 2 , … , ∈ . The goal is to determine some function using given sets, that will correct map points from X to Y: ( ) = for any point from . Related work. The semi-supervised learning described in literature not so widely as unsupervised or supervised, especially algorithms implementation.
In [4] Jesper E. van Engelen and Holger H. Hoose gives an overview of semi-supervised approaches describes assumptions of semi-supervised learning especially: smoothness, low-density and manifold.
Semi-supervised approach demonstrates high efficiency in solving clustering problems, the idea of using of clustering algorithm was described in the review [5]. The majority of these methods are modifications of the popular k-means clustering method.
One of the simplest unsupervised approach is K-Medoids also known as Partitioning Around Medoid algorithm was proposed in 1987 by Kaufman and Rousseeuw in [6]. A medoid is a point in the cluster, whose average dissimilarities with all the other points in the cluster is minimum.
K-medoid is a partitioning technique of clustering, which clusters the data set of n objects into k clusters, with the number k of clusters assumed known a priori.
Both the k-means and k-medoids algorithms are partitional, which breaking the dataset up into groups, and both attempt to minimize the distance between points labeled to be in a cluster, and a point designated as the center of that cluster. In contrast to the k-means algorithm, k-medoids choose data points as centers and can be used with arbitrary distances, while in k-means the center of a cluster is the average between the points in the cluster ( fig. 2). Consequently, K-medoids is more robust to noise and outliers as compared to K-means.

Fig. 2. Mean and medoid difference
The supervised approach described in [7]. The nearest neighbor decision rule assigns to an unclassified sample point the classification of the nearest of a set of previously classified points. Thus, for any number of categories, the probability of error of the nearest neighbor rule is bounded above by twice the Bayes probability of error. In this sense, it may be said that half the classification information in an Вісник Національного технічного університету «ХПІ». Серія: Системний 100 аналіз, управління та інформаційні технології, № 1 (5) 2021 infinite sample set is contained in the nearest neighbor. One of the popular semi-supervised methods is kernels based methods [8], especially Transductive support vector machines [9,10]. This method has the same pros and cons as classic Support Vector Machine, but the main cons are that the algorithm works only with binary classification and has exponential computation time while a data set to increase.
Semi-supervised methods. As a baseline was chosen clustering algorithms implemented in the scikit-learn library [11,12]. Algorithms use different approaches and library has interfaces for using custom metrics.
The proposed algorithm uses the K-medoid approach as a base idea. However, unlike K-medoids the proposed algorithm first calculates medoids using only labeled data and next process unlabeled classesassign labels of nearest medoid.
This algorithm has the following pros: reduced processing time, because required only multiple iteration throw points unlike standard K-medoid; more robustness to wrong assigned labels, because the algorithm gives higher weights to labeled data in the medoids calculation step.
Another proposed approach uses the idea of K-nearest neighbors and K-Mean algorithm, because for classifying we use both information about the nearest points and classes centers of mass (algorithm 1).
As a distance metric was used Euclidean distance but any metric could be used.
Classes' centers do not recalculate after each assignment, because experiments show that it does not bring results but takes more computation time.
So, the described above method allows: consider information about the nearest point, because in most cases point has the same label as its neighbors; combine a different kind of information; tune weight of different sources using input parameters.
Experiments. For experiments purpose was generated multiple datasets using sklearn library. Each dataset contains 250 points in 2D space. Available only 10% of labels as default. In addition, datasets have multiple clusters with different distributions and shapes ( fig. 3).
We will compare different approaches to find average accuracy on all these. In Tab. 1 we can see that the bestunsupervised method is K-nearest neighbors based algorithm has higher average accuracy. The fig. 4 shown the same result. Especially the K-nearest neighbors based approach has better accuracy in case of closely located clusters with the same distribution.
Another required feature of a semi-supervised algorithm is quality versus a number of labels dependency: more labelshigher quality and vice versa. However, fig. 5 shows that the proposed methods perform more accuracy with increasing number of available labels. Algorithm 1. Object classification using K-NN based approach Input: Xfeature matrix n*m, nnumber of objects, mnumber of features ylabels vector of length n, y[i] = -1 if no label data for i-th object Knumber of nearest points Cweight of nearest class center Output: y_predictedvector of length n with object labels 1: y_predicted ← empty list of length n 2: unlabeled_idxs ← list of indexes where y = -1 3: labeled_idxs ←list of indexes where y > -1 4: center_coordinates ← list of center coordinates for each class, calculated using available labels 5: random shuffle unlabeld_idxs 6: for i in unlabeld_idxs do 7: distances_i ← distances from i-th object to each object with indexes in labeled_idxs 8: argsort distances_i 9: nearest_idxs ← indexes of first K elements from distances_i 10: classes_dist_i ← distance from i-th object to each classes' center 11: neares_class_idx ← index of nearest class to i-th object 12: cls_counts ← list, where j-th element denote numbers of points belong to j-th class among nearest_idxs 13: cls_counts  Conclusions. In this study, we had shown that even small amounts of labeled data allow using of semisupervised learning and improving accuracy. In addition, semi-supervised learning can improve algorithm performance too. Multiple approaches to semi-supervised learning were proposed, they are using a distance metric that considers available label information.