PRIVACY MODELS AND ANONYMIZATION TECHNIQUES FOR TABULAR HEALTHCARE DATA

Authors

DOI:

https://doi.org/10.20998/2079-0023.2024.02.12

Keywords:

healthcare data, tabular data, data anonymization, privacy models, k-anonymity, l-diversity, t-closeness, data anonymization techniques, differential privacy

Abstract

In today's world, issues of privacy and personal data protection are becoming extremely relevant, especially in the healthcare field, where the use of large volumes of data for research is becoming increasingly common. The use of personal data is regulated by relevant laws that require data anonymization to minimize the risks of identifying individuals. Anonymization is a process that allows the use of sensitive data without the risk of disclosing personal information while maintaining its utility. This article discusses the main privacy models and anonymization techniques used to protect tabular healthcare data. Privacy models include k-anonymity, l-diversity, and t-closeness. The k-anonymity model ensures that any combination of quasi-identifiers is shared by at least k records. The l-diversity model complements k-anonymity by requiring at least l unique combinations of sensitive attribute (SA) values in each equivalence class. The t-closeness model considers the distribution of these sensitive attribute values, ensuring that the distance between the SA distribution in the equivalence class and the overall distribution does not exceed a specified threshold. Anonymization techniques include generalization, suppression, relocation, permutation, perturbation, slicing, differential privacy, and synthetic data. Generalization reduces the precision of quasi-identifiers. Suppression removes certain values from the dataset to improve its statistical characteristics. Relocation changes a limited number of values in the data to enhance protection. Permutation mixes the values of quasi-identifiers between records while preserving the overall statistical features of the dataset. Perturbation adds noise to the data, increasing privacy. The idea of differential privacy also involves adding noise, but this is done at the query processing stage. Generating synthetic data allows the creation of new datasets that are similar in characteristics to the original data.

Author Biographies

Denys Kalinin, National Technical University "Kharkiv Polytechnic Institute"

Postgraduate of Department of System Analysis and Information-Analytical Technologies, National Technical University "Kharkiv Polytechnic Institute", Kharkiv, Ukraine

Valerii Severyn, National Technical University "Kharkiv Polytechnic Institute"

Doctor of Technical Sciences, Professor, Professor of Department System Analysis and Information-Analytical Technologies, National Technical University «Kharkiv Polytechnic Institute», Kharkiv, Ukraine

Mykola Bezmenov, National Technical University "Kharkiv Polytechnic Institute"

Candidate of Technical Sciences (PhD), Docent, Professor of the Department of System Analysis and Information-Analytical Technologies, National Technical University "Kharkiv Polytechnic Institute", Kharkiv, Ukraine

References

Document 32016R0679. Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation) (Text with EEA relevance). EUR-Lex. Access to European Union law. URL: https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX%3A32016R0679 (accessed 30.10.2024).

Summary of the HIPAA Privacy Rule. USA: United States Department of Health and Human Services. URL: https://www.hhs.gov/hipaa/for-professionals/privacy/laws-regulations/index.html (accessed 30.10.2024).

Sweeney L. k-anonymity: a model for protecting privacy. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems. 2002б vol. 10, is. 5, pp. 557–570. URL: https://epic.org/wp-content/uploads/privacy/reidentification/Sweeney_Article.pdf (accessed 30.10.2024).

Machanavajjhala A., Kifer D., Gehrke J., Venkitasubramaniam M.. L-diversity: Privacy beyond k-anonymity. ACM Transactions on Knowledge Discovery from Data. 2007, vol. 1, is. 1, pp. 3 – es.

Li N., Li T., Venkatasubramanian S. t-Closeness: Privacy Beyond k-Anonymity and l-Diversity. IEEE 23rd International Conference on Data Engineering. 2007, pp. 106–115. URL: https://ieeexplore.ieee.org/document/4221659 (accessed 01.11.2024).

Majeed A., Ullah F., Lee S. Vulnerability- and Diversity-Aware Anonymization of Personally Identifiable Information for Improving User Privacy and Utility of Publishing Data. Sensors. 2017, vol. 17, is. 5, article 1059. URL: https://www.mdpi.com/1424-8220/17/5/1059 (accessed 01.11.2024).

Samarati P., Sweeney L. Protecting Privacy when Disclosing Information: k-Anonymity and Its Enforcement through Generalization and Suppression. Semantic Scholar. 1998. URL: https://www.semanticscholar.org/paper/Protecting-privacy-when-disclosing-information%3A-and-Samarati-Sweeney/7df12c498fecedac4ab6034d3a8032a6d1366ca6 (accessed 30.10.2024).

Nergiz M. E., Gök M. Z. Hybrid k-anonymity. Computers & security. 2014, vol. 44, pp. 51–63. URL: https://www.researchgate.net/publication/261139007_Hybrid_k-Anonymity (accessed 30.10.2024).

Martin D. J., Kifer D., Machanavajjhala A., Gehrke J., Halpern J. Y. Worst-Case Background Knowledge for Privacy-Preserving Data Publishing. IEEE 23rd International Conference on Data Engineering. 2007, pp. 126–135. URL: https://www.academia.edu/1520959/Worst_Case_Background_Knowledge_for_Privacy_Preserving_Data_Publishing (accessed 30.10.2024).

Domingo-Ferrer J., Mateo-Sanz J. M. Practical Data-Oriented Microaggregation for Statistical Disclosure Control. IEEE Transactions on Knowledge and Data Engineering. 2002, vol. 14, no. 1, pp. 189–201.

Gal T. S., Tucker Th. C., Gangopadhyay A., Chen Z.. A data recipient centered de-identification method to retain statistical attributes. Journal of biomedical informatics. 2014, vol. 50, pp. 32–45.

Li T., Li N., Zhang J., Molloy I. Slicing: A New Approach to Privacy Preserving Data Publishing. IEEE Transactions on Knowledge and Data Engineering. 2010, vol. 24, is. 3, pp. 561–574.

Dwork C. Differential Privacy. Automata, Languages and Programming. 2006, pp. 1–12.

Abadi M., Chu A., Goodfellow I., McMahan H. B., Mironov I., Talwar K., Zhang L. Deep Learning with Differential Privacy. CCS'16: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. 2016, pp. 24–28.

Kingma D. P., Wellin M. Auto-Encoding Variational Bayes. arXiv. 2013, article 1312.6114. URL: https://www.semanticscholar.org/reader/5f5dc5b9a2ba710937e2c413b37b053cd673df02 (accessed 01.11.2024).

Goodfellow I. J., Pouget-Abadie J., Mirza M., Xu B., Warde-Farley D., Ozair Sh., Courville A., Bengio Y. Generative adversarial networks. arXiv. 2014, article 1406.2661. URL: https://dl.acm.org/doi/pdf/10.1145/3422622 (accessed 01.11.2024).

Published

2025-01-04

How to Cite

Kalinin, D., Severyn, V., & Bezmenov, M. (2025). PRIVACY MODELS AND ANONYMIZATION TECHNIQUES FOR TABULAR HEALTHCARE DATA. Bulletin of National Technical University "KhPI". Series: System Analysis, Control and Information Technologies, (2 (12), 81–85. https://doi.org/10.20998/2079-0023.2024.02.12

Issue

Section

INFORMATION TECHNOLOGY