ADAPTATION OF LAMBDAMART MODEL TO SEMI-SUPERVISED LEARNING
DOI:
https://doi.org/10.20998/2079-0023.2023.01.12Keywords:
learning to rank, information retrieval, semi-supervised learning, pairwise ranking, LambdaMART, pseudo labeling, NDCGAbstract
The problem of information searching is very common in the age of the internet and Big Data. Usually, there are huge collections of documents and only multiple percent of them are relevant. In this setup brute-force methods are useless. Search engines help to solve this problem optimally. Most engines are based on learning to rank methods, i.e. first of all algorithm produce scores for documents based on they feature and after that sorts them according to the score in an appropriate order. There are a lot of algorithms in this area, but one of the most fastest and a robust algorithm for ranking is LambdaMART. This algorithm is based on boosting and developed only for supervised learning, where each document in the collection has a rank estimated by an expert. But usually, in this area, collections contain tons of documents and their annotation requires a lot of resources like time, money, experts, etc. In this case, semi-supervised learning is a powerful approach. Semi-supervised learning is an approach to machine learning that combines a small amount of labeled data with a large amount of unlabeled data during training. Unlabeled data, when used in combination with a small quantity of labeled data, can produce significant improvement in learning accuracy. This paper is dedicated to the adaptation of LambdaMART to semi-supervised learning. The author proposes to add different weights for labeled and unlabeled data during the training procedure to achieve higher robustness and accuracy. The proposed algorithm was implemented using Python programming language and LightGBM framework that already has supervised the implementation of LambdaMART. For testing purposes, multiple datasets were used. One synthetic 2D dataset for a visual explanation of results and two real-world datasets MSLR-WEB10K by Microsoft and Yahoo LTRC.
References
Burges C. J. C., Svore K. M., Wu Q., Gao J. Ranking, boosting and model adaptation. Available at: https://www.microsoft.com/en-us/research/publication/ranking-boosting-and-model-adaptation/ (accessed 07.04.2023).
Chang Y., Chapelle O. Yahoo! Learning to Rank Challenge Overview. JMLR: Workshop and Conference Proceedings 14. 2011, pp. 1–24.
Xu H., Li H. AdaRank: A Boosting Algorithm for Information Retrieval. Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 2007, pp. 391–398.
Yilmaz E., Szummer M. Semi-supervised Learning to Rank withPreference Regularization. Proceedings of the 20th ACM International Conference on Information and Knowledge Management. 2011, pp. 269–278.
Burges C. J. C. From RankNet to LambdaMART to LambdaMART: An Overview. Available at: https://www.microsoft.com/en-us/research/publication/from-ranknet-to-lambdarank-to-lambdamart-an-overview/ (accessed 07.04.2023).
Grira N., Crucianu M., Boujemaa N. Unsupervised and Semi-supervised Clustering: a Brief Survey. Available at: http://cedric.cnam.fr/~crucianm/src/BriefSurveyClustering.pdf (accessed 07.04.2023)
Vapnik V. N. Statistical Learning Theory. New York, Wiley, 1998. 768 p.
Rahangdale A. U., Raut, S. Clustering Based Transductive Semi-supervised Learning for Learning-to-Rank. International Journal of Pattern Recognition and Artificial Intelligence. 2019, vol. 33, no. 12, pp. 1951007:1–1951007:27. DOI: 10.1142/s0218001419510078.
Amini M., Truong T., Goutte C. A Boosting Algorithm for Learning Bipartite Ranking Functions with Partially Labeled Data. Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2008. 2008, pp. 99–106.
Szummer M., Yilmaz E. Semi-supervised Learning to Rank with Preference Regularization. Proceedings of the 20th ACM International Conference on Information and Knowledge Management. 2011, pp. 269–278.
Weston J., Leslie C., Ie E., Zhou D., Elisseeff A., Noble W. S. Semi-supervised protein classification using cluster kernels. Bioinformatics. 2005, vol. 21, no. 15, pp. 3241–3247.
Valizadegan H., Jin R., Zhang R., and Mao J. Learning to Rank by Optimizing NDCG Measure. Advances in Neural Information Processing Systems 22: 23rd Annual Conference on Neural Information Processing Systems 2009. 2009, pp. 1883–1891.
Downloads
Published
How to Cite
Issue
Section
License
This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).